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0 A parallel array processor for massively parallel applications is formed with low power CMOS with DRAM 
processing while incorporating processing elements on a single chip. Eight processors on a single chip have 
their own associated processing element, significant memory, and I/O and are interconnected with a hypercube 
based, but modified, topology. These nodes are then interconnected, either by a hypercube, modified hyper- 
cube, or ring, or ring within ring network topology. Conventional microprocessor MMPs consume pins and time 
going to memory. The new architecture merges processor and memory with multiple PMEs (eight 16 bit 
processors with 32K and I/O) in DRAM and has no memory access delays and uses all the pins for networking. 
The chip can be a single node of a fine-grained parallel processor. Each chip will have eight 16 bit processors, 
each processor providing 5 MIPs performance. I/O has three internal ports and one external port shared by the 
plural processors on the chip. Significant software flexibility is provided to enable quick implementation of 
existing programs written in common languages. It is a developable and expandable technology without need to 
develop new pinouts, new software, or new utilities as chip density increases and new hardware is provided for a 
chip function. The scalable chip PME has internal and external connections for broadcast and asynchronous 
SIMD, MIMD and SIMIMD (SIMD/MIMD) with dynamic switching of modes. The chip can be used in systems 
which employ 32, 64 or 128,000 processors, and can be used for lower, intermediate and higher ranges. Local 
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and global memory functions can ail Da provided by the crdps themselves, and lbs system can connect to and 
support other global memories and DASO. The cblp can be used aa a microprocessor accelerator, In personal 
computer applications, as a vision or avionics computer system, or as workstation or supercomputer. There is 
program compatibility for the flJy scalable system. 
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FIELD OF THE INVENTIONS 

Th* invention totem to computet and compute* systems end particularly to pareflel array processor 
In accordance vmh the invention, a parallel array processor (APAP) may be incorporated on a single 
9 semiconductor silicon chip. This chip forms a basis for the systems described which are capable ol 
massively parallel processing of complex sctenafte and business appifcationa. 

REFERENCES USED IN THE DISCUSSION OF THE INVENTIONS 

w In the detailed discussion of (he invention, ofter works will be referenced, inducing references to our 
own unpublished works which are not Prior Art, which will eld the reader in following the discussion. 

GLOSSARY OF TERMS 

;s o ALU 

ALU te the arKhmetfc logic unit portion of a processor, 
0 Array 

Array refers to an arrangement or ©lemerrts in one or more dimensions. An array can Include an 
ordered set of data Items {array element) which in languages like Fortran are identified By a single 

so name, m other languages such a name of an ordered set of date items refers io an ordered collection 
or set of data elements, ail of which have identical attributes. A program array has Dimensions 
specified, generally by a number or dimension aitrftute. The declarator of (he array may also specify 
the size of each dimension of the array in some languages, m some languages, an array is an 
arrangemeni of elements in a table, in a hardware sense, an army is a collection of structures 

2$ (functional elements] which are generally fcdertfcai In a massively parallel architecture. Array elements 
in data parallel computing are elements whteh can be assigned operations and when parallel can each 
Independently and in parallel execute the operations required. Generally, arrays may be thought of as 
grids of processing elements. Sections of the array may be assigned sectional date, so that sectional 
data can be moved around in a regular grid pattern. However, data can be indexed or assigned to an 

30 arbitrary location In an array. 

0 Array Director 

An Array Director is a unit programmed as a controller for an array. It performs the function of a 
master controller for a grouping of functional elements arranged to an array, 
o Array Processor 

as There two principal types of array processors - mufiiple instruction multiple data (MIMD) and single 

instruction multiple data (SIMD), In a mimd array processor, each processing element In (tie array 
executes Its own unique instruction stream with its own oata In a SWD array processor, each 
processing element In the array is restricted to the same instruction via a common instruction stream; 
however* the data associated with each processing element is unique. Our preferred array processor 

*o has omer characterience. We call if Advanced Parallel Array Processor, and use the acronym APAP, 
o Asynchronous 

Asynchronous is without a regular time reiaaooshic; die execution of a function is unpredictable with 
respect to the execution of other functions which occur without a regular or predictable time 
relationship to other function executions. In control situations, a controller will address a location to 
4* vtfuch control is passed when data is waiting for an idle element being addressed. This permits 
operations to remain in a sequence while they are out of time coincidence with any event 
o B0PS.JGOP8 

BOPS or OOPS are acronyms having the same meaning - bfllions of operations per second. See 
OOPS. 

30 o Circuit Switched/Store Forward 

These terms refer to two mechanisms for moving data packets through a networK of nodes. Store 
Forward is a mechanism whereby a data packet Is received by each intermediate node, stored into its 
memory, end then forwarded on towards Its destination. Circuit Switch Is a mechanism whereby an 
intermediate node is commanded to logicany connect its Input port to an output port sucn that daia 

58 packets can pass directly through the node towards their destinaSon, without entering the Inteirnediate 

node's memory, 
o Cluster 

A cluster is a station (or fimciional unit) which consists of a control unit Cluster controller) and the 
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hardware (which may be terminals, functional units, or virtual components) attached to It Our Clusier 
Includes an array of PMEs sometimes called a Node array. Usually a clusier has S12 PMEs. 

Our Entire PME node array consists of a $91 of clusters* each duet* supported by a duster 
controller (CC), 

s o Cluster controller 

A duster coroner is a device the* controls input'output {VO) operations for intra than one dsvioe or 
functional unit connected to It A clusier comrdfsr Is usually cornroJIed by a program stored end 
executed In the unH as it was In the IBM 3901 Finance Communication Controller, but It can be 
entirely controlled by hardware ae *t was in the IBM 3272 Control Unit 

30 o duster synchroniser 

A duster synchronizer Is a functional unit which maneges the operations of alt or part of a duster to 
maintain synchronous operation of the elements so that the functional units maintain a particular time 
relationship with the execution of a program, 
o Controller 

75 A controller is a device fo* directs the transmission of data and instructions over the links of an 

interconnection network: its operation Is controlled by a program executed by a processor to which 
the controller is connected or by a program executed within the device, 
o CMOS 

CMOS Is an acronym for Complementary Metal-Oxide Semiconductor technology. It Is commonly 
20 used to manufacture dynamic random access memories {DRAMe). nmos is another technology used 
to manufacture DRAMS, Wa prefer CMOS but the technology used to manufacture the APAP Is not 
intended to Emit the scope of the semiconductor ischnoJcgy vuMch 19 employed. 
0 Dotting 

Dotting refers to the joining of three or more leads by physically connecting them togefoer. Most 
29 backpanel busses share this connection approach. The term relates to Oft DOTS of times past but Is 
used here to identity multiple data sources thai can be combined onto & bus by a very simple 
protocol. 

Our I/O ripper concept can be used to implement the concept that the port into a node could be 
driven by the port out of a node or by data coming from the system bus. Conversely, data being put 
so out of a node would be available 10 both ihe input to another node and to the system bus. Note thai 
cutpuiting data to both Hie system bus and another node is not done simultaneously but in dnfererfi 
cycles. 

Booing Ib used In the H-OOT discussions where Two-ported PEs or PMEa or Pickets can be used 
in arrays of various organizations by taking advantage of dotting. Several topologies are discussed 
as including 20 and 3D Meshes. Base 2 N-cube, Sparse Base 4 N-cube. and Sparse Base 8 N-cube. 

0 DRAM 

DRAM Is an acronym for dynamic random access memory. tl» common storage used by computers 
for main memory. However, the term DRAM can be applied to use as a cache or as a memory which 
is not the main memory. 

# 0 FLOATING-POINT 

A floating-point number is expressed in two parts. There is a fixed point or fraction part, and an 
exponent part to some assumed radix or Base. The exponent indicates the actual placement of the 
decimal point. In the typical floating-point representation a real number 0,0001234 ts represented as 
0. 1234 -3, where 0.1234 is the fixed-point part and O is the exponent. In this exampb, ihe floating- 

<r$ pdnt radix or base is 10, where 10 represents the implicit fixed poelflve Integer basOi greater than 
unity, tot is raised to the power explicitly denoted by the exponent In the floating-point representation 
or represented by the characteristic in the floating- point representation and than multiplied by (he 
fixed-point part to determine the real number represented. Numeric literals can be expressed in 
floating-point notation as well as real numbers, 

so o FLOPS 

This terms refers to floating-point instructions per second. Floating-pcim operations include ADD. 
SUB, MPY, DiV and often many others. Float ing-poirn Instructions per second parameter Is often 
calculated using the add or multiply instructions and. in general, may be considered to have a 50750 
mix. An operation includes the generation of exponent, fraction end any required fraction normalize- 
&3 lion, We could address 32 or 48-Wt floating-point formats for longer but we have not counted them In 

the mix,} A floating-point operation when implemented with fixed point instructions {normal or RISC) 
requires multiple instructions. Some use a 10 to 1 ratio in figuring performance while some specific 
studies have shown a ratio of 6.25 more appropriate to use. Various architectures will have different 



4 



BP 0 570 729 A2 



ratios, 
o Functional unit 

A functional urtl is an entity of hardware software*, or both, capable of accomplishing a purpose, 
o kbytes 

5 Gbytes refers to a biUion bytes. Gbyte&s would b* a billion bytea per second. 
0 GIGAFLOPS 

H0r9 floating-point Instructions per second, 
o GOPSand PETAOP9 

GOP6 or BOPS, have fte same meaning - bliilons of operations per eeoond PETAOPS means 
to trillions of operation per second, a potential of the current machine. For our APAP machine they are 
just about the same as BIPs'GPs meaning billions of insfrucDons per second. In some machines an 
instruction can cause two or more operations (te_ both en add and multiply) but we don't do thai 
Alternatively it could take many insfmctions to do an op. For exampte we use multiple irisiructiona to 
perform 64 bit arithmeac- In courting ops however, we did not elect to count log ops. GOPS may be 
*a the preferred use to describe performance, but there Is no consistency In usage that has been noted. 
One sees MIPsrtAOPs then SIPs/BOP* and Me9aFLOP^<Si9aFLOPS.TeraFLOPSrPetaFiops. 
o ISA 

ISA means the Instruction Set Architecture, 
o Link 

m A link is an element which may be physical or logical. A physical link Is the physical connection for 

joining elements or unite, while in computer programming a link Is an instruction or address that 
passes control and parameter* between separate portions of the program, m muitteyetems a link te 
Die connection between two systems which may be specified by program code Identifying the Bnk 
which may be identified by a reel or virtual address. Thus generally a link includes the physical 
medium, any protocol, and associated devices and programming; ft is both logical and physical, 
o MFLOPS 

MFLOPS means (lore floating-point instructions per second, 
o MIMD 

MlfciD is used to refer to a processor array architecture wherein each processor in the array has its 
so own instruction stream, thus Multiple Instruction stream, to execute Multiple Data streams located one 
per processing element, 
o Module 

A module is a program unit ftat le discrete and identifiable or a functional unit ot hardware designed 
lor use with other components. Also, a collection of PCs contained in a single electronic chip is called 
3* a module, 
o Node 

Generally* a node is the junction ol links. In a generic array of PEs< one PE can be a node, A node 
can also contain a collection of PEs called a module, m accordance with our invention a node *» 
formed of an array of PMEs, and we refer to the set ol PMEs as a node. Preferably a node is 8 PMEs, 
40 o Node array 

A collection of modules made up of PMEs is sometimes referred to as a node array, is an array of 
nodes made up of modules. A node array is usuaSy more than a few PMEs, but the term 
encompasses a plurality. 

o poe 

<s A PDE is a partial differential equation, 

o PDE relaxation solution process 

PDE relaxation sototiort process a vay to solve a POe (partial differential equation}. Solving PDEs 
uses most of the super computing compute power in the known universe and can therefore be a good 
example of the relaxation process. There are many ways to solve the PDE equation and more than 

so one of the numerical meftods includes the relaxation process. For example, if a PDE is solved by 
finite element methods relaxation consumes the bulk of the computing lime. Consider an example 
from the world of heat transfer. Given hot gas inside a chimney and a cold wind outside, how will the 
temperature gradient within fre chimney bricks develop? By considering tie bricks as tiny segments 
and wilting an equation thai says how heal flows between segments as a function of temperature 

as differences then the heat transfer PDE has been converted into a finite element problem. If we then 
say all elements except those on the inside and outside are at room temperature while the boundary 
segments are at ihe hot gas and cold wind temperature, we have set up the problem to begin 
relaxation. The computer program then models time by updating the temperature variable in each 
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segment based upon the amount of heat that Hows into or out of the segment ft takes many cycles ai 
processing aB !he segments in the mode) before ihe eel or temperature variables across (he chimney 
relaxes 10 represent actual temperature distrfoution that would occur in the physical chimney. If tit* 
objective was to model 909 cooling in the ohlmnoy then me etemente would have to extend to gas 

s equations, and the boundary conditions on the Inside would be linked to another finite element model, 

and me process conSnues. Note mat the heel How is dependent upon the temperature difference 
between the segment and Its neighbors. It thus uses tfie lnter-PE communication paths to distribute 
the temperature variables, ft Is this neat neighbor communication pattern or characteristic that makes 
PDE relason very applicable to parallel computing. 

30 o PICKET 

This is the element in an array of elements making up an array processor. It consists of; data flow 
<ALU REGS), memory, control, and trie portion of m comnuintartion maarix associated with the 
element. The unit refers to a 1/hft of an array processor made up of parallel processor and memory 
elements wrfri their control and portion of the array Intercommunication mechanism. A picket Is a form 

>s of processor memory element or PME. Our PME chip design processor lope can implement the 

picket logic described in related applications or have the logic for the array of processors formed as a 
node. The term PICKET is similar to the commonly used a/ray term PE for processing element and 
la en element of »e processing array preferably comprised of a combined processing element and 
local memory for processing bit parallel bytes o? Information in a clock cycle. The preferred 

to embedment consisting of a byte wide data flow processor, 32k bym or more of memory, primhr/e 
controls and ties to communications with other pickets. 

The term "plckeT comes from Tom Sawyer and his white fence, although it will also be 
understood functionally that a military picket line analogy fns quite well, 
0 Picket Chip 

2a A picket chip contains a plurality of pickets on a single silicon chip. 

0 Picket Processor system (or Subsystem) 

A picket processor is a total system consisting o( an array of pickets, a communication network, an 

ifO system, and a SIMD controller consisting of a microprocessor, a canned routine processor, and a 

microcontroller that runs the array. 
30 o Picket Architecture 

The Picket Architecture is the preferred embodiment for the SIMD architecture wrih features that 

accommodate several diverse kinds of problems including: 

- set associative processing 

- parcel numerically Intensive processing 

39 • physical array procesEtng similar to images 

0 Picket Array 

A picket array is a coltecfjon of pickets arranged in a geometric order, a regular array, 
o PME or processor memory element 

PME 19 used for a processor memory element We use the term PME to refer to a single processor, 

4* memory and vo capable system element or unri thai forms one of our parallel array processors. A 
processor memory element is a term which encompasses a picket A processor memory element is 
1/ntb of a processor array wWch comprises a processor, lis associated memory, control interface, and 
a portion of an may communication network mechanism. This element can have a processor memory 
element wflh a rormectirity of a regular array, as in a picket processor, or as part of a BUbarray, as in 

<s the multi-processor memory element node we have described, 
o Routing 

Routing ie (he assignment of a physical path by which a message will reach Its domination. Routing 
assignments have a source or origin and a destination. These elements or addresses have a 
temporary relationship or efnnfly. Often, message* routing w based upon e key which is obtained by 
so reference to a table of assignments. In a network, a destination is any station or network addressable 

unit addressed as the destination of information fraitemitted by a path control address that identifies 
the folk. The destination field identifies the destination with a message header destination code. 

0 $imd 

A processor array archtecture wherein an processors in the array are commanded from a Single 
as Instruction stream to execute Multiple Data streams located one per processing element. 
0 SlMDMIMD or $IMU/MiftfD 

SIMDMIMD or SMMAMD is a term referring to a machine that has a dual function that can switch 
from MIMD to SiMD for a period of time to handle some complex instruction, and thus has two 
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modes. The Thinking Machines, Inc. Connection Machine model CM-2 when placed as a front end or 
back end of a M1MD machine permitted programmers to operate different mode© for execution ol 
different parts of a problem, referred to sometime* a dual motto*. These machines have existed since 
Ulrac and have employed a bus that interconnects the master CPU with other processors. The master 
s control processor would hate the capability of Interrupting the processing of other CPUs. The other 

CPUs could run independent program cods. During an interruption, dome provision must be made for 
checkpointing (closing and saving current status of the controlled processors). 
O SIMIMD 

SI Ml MO £ a processor array architecture vmeretn aii processors fn the array are commanded from a 
?o Single Instruction stream, lo execute Multiple Data Btr&ams located one per processing element 

Wtmio this construct* data dependent operations within each picket thai mimic instruction execution 

are controlled by the SIMD instruction stream. 

This is a Single Instruction Stream machine with the aniSly to sequence Multiple Instruction 

streams (one per Picket) using the SIMD Instruction stream and operate on Multiple Data Streams 
;s (one per Picket). SIMIMD can be executed by a processor memory element system, 

SISD 

SISD is an acronym for Single Instruction Single Data, 
o Swapping 

Swapping interchanges the data content of a storage area with that of another area of storage, 
o Syncfwonous Operation 

Synchronous operation in a MIMD machine is a mode of operation In which each action is related lo 
an event (usually a clock): il can be a specified event that occurs regularly in a prog/am sequence. An 
operation Is dispatched to a number of pes who then go off to independently perform the function. 
Control is not returned to the controller until the op&rfiSon is completed. If the request is to an array of 
functional unite, the request Is generated by a controller to elements in the array which must complete 
their operation before control Is returned to the controller. 
0 TERAFLOPS 

TERAFLOPS means (I or 12 floatlng'potm Instructions per second. 
0 VLSI 

VLSI is an acronym for very large scale integration <as applied to integrated circuits/, 
o Zipper 

A zipper is a new function provided. It allows for links to be made from devices which are external to 
ins normal interconnection of an array configuration. 

BACKGROUND OF THE INVENTION 

In the never ending quest lor faster computers, engineers are linking hundreds, and even thousands of 
<o low cost microprocessors together m parallel to create super supercomputers that divide in order to conquer 
complex problems that stump today's machines. Such machines are called massively parallel- We have 
created a new way to create massively parallel systems. The many improvements which we have made 
should be considered against the background of many works of others. 

Multiple computers operating in parallel have existed for decades. Early parallel machines included the 
* iuiac whteh v*ea started In the 1960s. ILUAC IV was bunt in the 1970s. Other multiple processors Include 
(see a partial summary in 11,3. Patent 4,075,834 issued December 4, 1990 to Xu et at) the Cedar, $rgma~i, 
the Butterfly and the Monarch, the Intel ipsa The Connection Machines, the Caliech COSMIC, Ihe N Cube, 
IBM's RP3> IBM's OFH , ihe NYU Ultra Computet* me Intel Date and Touchstone. 

Large multiple processors beginning with ILUAC have been considered supercomputers, Supercom- 
3Q p Liters with greatest commercial success have been based upon multiple vector processors, represented by 
the Cray Research Y-MP systems, me IBM 3090, and other manufacturer's macnmes induing those of 
Amdahl, Hitachi, Fujitsu, and NEC. 

Massively Parallel Processors (MPPs) are now thought of as capable of becoming supercomputers. 
These computer systems aggregate a large number of mlcroprooeseors with an interconnection network 
58 and program ftem to operate in parallel. There have been two modes of operation of these computers. 
Some of these machines have been mimd mode machines. 

Some of these machines have been SIMD mode machines. Perhaps me most commercially acclaimed 
of these machines has been the Connection Machines series 1 and 2 of Thinking Machines, mc . These 
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have been esssrnlally SIMD machines. Many of th» massively parallel machines have used microprocessors 
Interconnected In parallel to obtain their concurrency or parallel operations eapebi&ty. Intel mrcroprcoeesore 
Kka i860 i^ave been used by Intel and others. N Cub* has made such machines wfcih Intel '388 
microprocessors. Other machines haw teen built with what fe called the "transputer" chip, tamos 
s Transputer IMS 7800 Is an example. The Inmos Transputer T800 Is a 32 b& device wilh an Integral high 
speed floating point processor 

• As an example of tte kind of systems that aie built, several Inmos Transputer T&00 chips each would 
have 32 communication ttnfc Inputs and 32 link outputs. Each chip would have a single processor, a small 
amount of memory, and commurucetion links to the local memory and to an external interface. In addition, 

70 in order to build up the system communication Jink adaptors fike IMS C011 and 0012 would be connactad. 
In addition switches, like a IMS C004 would provide, say, a crossbar swHch between the 32 Rr* inputs and 
32 link outputs to provide point-to-point connection between additional transputer chips. In addfflon, there 
will be special circuitry and interface chips for transputers adapting them to be used tor a special purpose 
tailored to the requirements of a specific device, a Graphics or disk controller. The Inmos IMS M212 ts a 16 

*5 bit processor, with on chip memory and communication finks, ft contains hardware and logic to control disk 
drives and can be used as a programmable disk controller or as a general purpose interface. In order lo use 
the concurrency {parallel operations) Inmos developed a special language, Occam, for the transputer. 
Programmers have to describe the network of transputers directly in an Occam program. 

Seme of these massively parallel machines use parallel processor arrays of processor chips which are 

20 interconnected wijh different topologies. The transputer provides a crossbar network with the addition of IMS 
C0O4 chips. Some orher systems use a hyper cube connection. Others use a bus or mesh to connect the 
microprocessors and there associated circuitry. Some have been interconnected by circuit switch proces- 
sors that use swHches as processor addressable networks. Generally, as with the 14 RISOSOOOs which 
were interconnected last fall at Lawrence Livenmore by wiring ihe machines together, the processor 

25 addressable networks have been considered as coarse-grained rtuiltipjocessors. 

Some very large machines are being built by Intel and nCube and others to aitack what ere called 
'grand challenges" in data processing. However, ihese computers ere very expensive, Recent projected 
costs aie in the order of $30,000*000 00 to $75,000.000 00 (Jera Computer) for computers whose 
development has been funded by me U S. Government to attack the "grand challenges*. These "grand 

30 challenges" would Include such problems as climate modeling, fluid turbulence, pollution dispersion, 
mapping of the human genome and ocean circulation, quantum chromodynamics, semiconductor and 
supercomputer modeling, combustion systems, vision Esvi cognition. 

As a footnote to our background, we should recognize one of the early massively pnfeSkA machines 
developed by IBM, In our description we have chosen to use the term processor memory element rather 

33 than "transputer 11 to describe one of the eight or more memory units with processor and 1.0 capabilities 
which make up the array of PMEs in a chip, or node. The referenced prior art "transputer 1 * lias on a chip 
one processor, a Fortran coprocessor, and a small memory, wrih an vo interface. Our processor memory 
element could apply to a transputer and to the PME of tie RP3 generally. However, as will bo recognized, 
our little chip is significantly different in many respects. Our little chip has many features described later. 

<tf However, we do recognize that the term PME was tire? coined fot another, now more typical. PME which 
formed the basis for the massively parallel machine known as the RP3, The IBM Research Parallel 
Processing Prototype (RP3) was an experimental parallel processor based on a Multiple Instruction Multiple 
Osta (MIMD) arcMecture, RP3 was designed and built at IBM T.J, Watson Research Center In cooperation 
with Ihe New York University Ultracomputer project. This work was sponsored in part by Defense Advanced 

45 Research Project Agency. RP3 was comprised of 64 ProceaaxiM^mcn' Elements (PMEs) Interconnected 
by a high speed omega network. Each PME contained a 32-bit IBM *PC scientific" microprocessor, 32*6 
cache, a 4-MB segment of the system memory, and an l*> port. The PME I/O pod hardware and software 
supported inttafizatiGn, status acquisition, as well as memory and processor communication through shared 
tO support Processors (!8Ps)- Each ISP supports eight processor- memory elements through the Extended 

so I/O adapters (ETlOs), independent of Ihe system network. Each ISP Interfaced to the IBM 3/370 channel 
and the IBM Token-Ring network as well as providing operator monster service. Each extended I/O adapter 
attached as a device to a PME ROMP Storage Channel <R$Q and provided programmable PME 
contrd/steius signal I/O via the ETIO channel. The ETK) channel Is the 32-bit bus which interconnected the 
ISP to the e*gnt adapters. The ETK) channel reled on a custom interface protocol with was supported by 

as hardware on the ETIO adapter and software on the ISP. 
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Problems addressed by our APAP machine 

The machine which we have cased the Advanced Parallel Array Processor (APAP) is a fine-grained 
parallel processor which we believe is needed to address issues of prior designs. As illustrated above, there 
have been many fine-grained (and also coarse-grained) processors constructed from both point design and 
oftthe-sheH processors using dedicated and shared memory and any one of trie many possible infercon- 
neafon schemes. To dan these approaches have all encountered one or more design and performance 
limitations, Each "solution 1 * leads In a different direction. Each has Its problems. Existing parallel machines 
are difficult to program. Each is net generally adaptable to various sbes of machines compatible across a 
range ot applications. Each has its design limitations caused by physical design, interconnection and 
architectural issues. 

Physical issues 

Some approaches utilize a separate chip design for each or the various functions required in a 
horizontal structure. These approaches suffer performance limitations due to chtp crossing delays. 

Other approaches integrate various functions together vertically onto a single chip. These approaches 
suffer performance limitations due to the physical limit on the number of Jogto gates which can be 
Integrated onto a producible chip. 

Interconnection Issues 



Networks which interconnect the various processing functions are important to fine-grained parafiel 
processors. Processor designs with buses, meshes, and hypercubes have all been developed. Each of 
these networks has inherent limitations as to processing capebffity. Buses limit both the number of 
processors which can be physically interconnected and the network performance. Meshss lead to large 
network diameters which limit network performance. Hypercubes require each node to have a large number 
of interconnection ports; the number of processors which can be interconnected Is limited by the physical 
mrxit'output pins at the node, Hypercubes are recognized as having some significant paitormance gains 
over Ihe prior bus and mesh structures. 

Architectural Issues; 

Processes which are suitable for fine-grained parallel processors fail into two distinct types. Processes 
which are functtonady partfflon&ble tend to perform better on mursple instruction, muftple data <MIMD) 
architectures. Processes which are not functionally partroonebfe but nave multiple data streams tend to 
perform belter on single instruction, multiple data (SIMO) architectures. For any given application, there is 
fltety to be some number of both types of processes- System trade-ofts are required to pick the 
architecture which best suits a particular application but no single solution has been satisfactory. 

SUMMARY OF THE INVENTION 

We have created a new way to make massively parallel processors and other computer systems by 
creating a new "chip" and systems designed with our new concept This application is directed to such 
systems, Components described in our eppacatlons can be combined In our systems to make new 
systems. They also can be combined wfth existing technology, 

Hurt, our little CMOS DRAM chip of approximately U x 14 mm can be put together much like bricks 
are walled in a building or paved to form a brick road. Our chip provides the structure necessary to build a 
"house", a complex computer system, by connected replication. 

Racing our development in perspective, four little chips, each one alike, each one with eight or more 
processors embedded in memory wish an Internal array capability and external I/O broadcast and control 
Interface, would provids me memory and processing power of Ihlrty-sbt or more complex compurers, and 
they ooUd all be placed with compact hybrid packaging into something the size of a watch, end operated 
with very low power, as each chip only dissipates about 2 watia. With this chip, we have created many new 
concepts, and those mat we consider our invention are described In detail in the description and claims. 
The systems tftat can be created with our computer system can range from small devices to massive 
machines wUh PETAOP potential. 
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Our litis memory chjp array processor we call our Advanced Paraflei Array Processor, Though small, ri 
to complex and powerful. A typical duster will have many chips. 

Many aspect* and features of invention have been described in tris and related applications. These 
concepts and features of invention improve and are appftcabto to computer systems which may not employ 
s each invention. We believe our concepts and features wffl be adopted and used In the next century. 

nils technical descrlpfion provklas en overview of our Advanced Parana Array Processor (APAP? 
representing our new memory concepts and our effort in developing a scalable massively parallel processor 
(MPP) that Is simple (very small number of unique part numbers) and has very high performance. Qui 
processor utilizes hi its preferred embodiment a VLSI chip. The chip comprises 2n PME microcomputers. 
70 "V represents the maximum number of array dimensicnafity. The chip further comprises a broadcast and 
control inteifece (BCD and internal and externa? communication pa&a between pmes on the chip among 
themselves and to the off chip system environment The preferred chip has 3 PMCs (but we also can 
provide more) and one BCh The 2n PMEs and DC I are considered a node. Tnis node can function in either 
SIMD or MIMD mode, in dual SIMD/MODE, wrth asynchronous processing, and with SIMIMD functionality, 
is Since it is scalable, this approach provides a node which can be the main building block for scalable 
paraflel processors of varying size. The microcomputer architecture of (he PME provides Fully distributed 
message passing Disconnection and control features within each node, or chip. Each node provides 
multiple parallel microcomputer capability at the chip level, the rnicroprroesor or personal computer level, at 
a workstation level, ai special application levete which may be represented by a vision and/or avtantes level, 
20 and, when fully extended, to capability at greater levels with powerM Gigaffop perfonnance into the 
supercomputer range. The simplicity Is achieved by the use of a single highly extended DRAM Chip that is 
rep&cated Into parallel clusters. This Keeps the part number count down and allows scaling capability to the 
cost or performance need, by varying me chip count, then the number of modules, etc. 

Our epproech enables us lo provide e mschaio with attributes meeting the requirements that drive to a 
2s per&Hel solu^on In a series of applications. Our methods of paralleilzetlon at the sub-chip level serve to keep 
weight, volume, and recurring and logistic costs down. 

Because our different size systems are all based upon a single chip, software toots are common tor all 
size systems. This offers the potential of development software (running on smaller workstation machines) 
that is interchangeable among aQ levels (workstation, aerospace, and supercomputer). That advantage 
30 means programmers can develop programs on workstations while a production program runs on a much 
larger machine- 
As a result of our well betancsd design implementaaon we meet today's requirement Imposed by 
technology, performance, cost and perception, and enable growth of the system Into lite future- Since our 
MPP approach starts at the chip level, our discussion starts at the chip technology description and 
33 concludes with ins supercomputer application descriptions. 

Physical, interconnection* and architectural issues will all be addressed In the machine cfirecUy. 
Functions wil not only be integrated into a single chip design, but tf>e chip design wii provide functions 
sufficiently powerful and flexible lhai foe chip will be effective al processing, routing, storage and three 
classes of WO. The Interconnection network will be a new version of the hypercube which provides minimum 
network diameters without the irtputfOutput pin and wireebiUty limitations normally associated wrih hyper- 
cubes. The trade-off between 31MO and MIMD are eliminated because the design allows processors to 
dynamically switch between MIMD and SINK) mode. This eliminates many problems wfflch will be 
encountered by application programmers of "hybrid 0 machines. In addition, me design will enow a subset of 
the processors to be in SIMD or MIMD mode. 
<« The Advanced Paiailei Array Processor (apap) is a fine-grebied parallel processor, it consists of control 
and processing seceons which are pe/troonable such that configurations suitable for supercoropuang 
through personal computing applications can be satisfied. In most configurations it would attach to a hoal 
processor and support the off loecfing of segments of the Iwsfe workload. Because the apap array 
processing etementB are general purpose computers, the particular type or workload off-loaded will vary 
so depending upon the capabillfles of fte host For example, our APAP can be a module for an IBM 3090 
vector processor mainframe. When aBacJted to a mainframe wrth high performance vector floating point 
capability the task off-loaded might be sparse to dense matrix transformations, Alternatively, when attached 
to a PC personal computer ihe off-loaded task might be numerically Intensive 3 dimensional graphics 
processing, 

53 The above referenced parent USSN 67/61 'I £H filed November i$ r 1$90 of Dfetffenderfer et af, tilted 
•"Parallel Associative Processor System" describes the idea of integrating computer memory and control 
logic wShin a single chip and replicating Ihe combination within the cWp and building a processor system 
out of replications of the single chip. This approach which is continued and expanded here leads to a 
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' system which providss massrvdy parallel processing capability at the cost of developing end manufacturing 
only a single chip type while enhancing performance capability by reducing the chip boundary crossings 
and Sue length. 

The above referenced parent US8N 07/611,504, fned November 13, 1990 illustrated utUtzaSon of 1- 

9 di mensiona l I/O slructures(essentlaDy a linear I/O) with multiple SIMD PMEs attached to ttiat structure within 
a chip. This embodiment elaborates these concepts to dimensions Qreaier man 1. The description which 
follows will be m terms of ^-dimensional I/O structures with ft SfMD'MIMD PMEs per chip. However, that 
can be extended to greater dimensionality or more PMEs pei dimension as we will describe with respect to 
FIGURES 3,9, 10. 15 and 16. 

w Our processing element includes a fuH VO system Including both data transfer and program Interrupts. 
Our description of our preferred embodiment will be primarily described in terms of the preferred 4- 
dirnenaional structures with 8 SIMD/MWD PMEs per chip, which has special advantages now in our 
view. However, thai can be extended to greater dimensionality or more PMEs per dimension as described 
in our parent application. In addition, for most applications we prefer and have made inventions In areas of 

/s greater dimensions with hypercube interconnections, preferably with the modified hypercube we describe. 
However, In some applications a 2-dimenslona] mesh Interconnection or chips will be applicable to a task at 
hand. For instance, in certain m3nary computers a 2 dimensional mesh will be suitable and cost effective. 

This disclosure extends the concepts from the Interprocessor communication to the external 
Input/Output facilities and describes the Interfaces and modules required (or control of (he processing array. 

£0 In eummary three types of t'O. inter-proceseor, processors tctfrom extern*!, and broadcaai'controi are 
described. Massively parallel processing systems require all these types of VO bandwidth demands to be 
balanced with processor computing capability. Within the array these requirements wil be satisfied by 
replicating a IS bit (reduced) msiruction set processor, awgrnented wim very fast interrupt state swapping 
capability- That processor is referred to as the PME illvstraling the preferred embodiment ol our APAP- The 

29 characteristics of the PME are completely unique when compared wUh the processing elements on other 
massively paranel machines. If permits me processing. rouang. storage and 1*0 to be completely distributed. 
This Is not characteristic of any other design. 

in a hypercube each PME can address as its neighbor, any PME whose address differs in any single bit 
position. In a ring, any PME can address as Its neighbor the two PMEs whose addresses differ s i. The 

3Q motflfted hypercube of our preferred embodiment utilised for the APAP combines these approaches by 
building hypercuues out of rings* The intersection of rings Is defined to be a node. Each node of our 
preferred system has its PME, memory and 1*0, and other features of the node, formed in a semiconductor 
sIBcon low level CMOS DRAM chip. Nodes are constructed from multiple PMEs on each chip. Each PME 
exists to only one ring of nodes. PMEs wShin the node are connected by additional rings such that 

38 communications can be routed between rings within the node. This leads to the addressing structure where 
any PME can step messages toward the objective by addressing a PME in its own ring or an adjacent PME 
wShin the node. In essence a PME can address a PME whose address differs by i in one In the 1n?d bit 
fteld ol its ring (where d is the number of PMEs in (he ring) or the PME with (he same address but existing 
in an adjacent dimension. The PME effectively appears to exist in n sets of rings, wwie m actually It exists 

«o only in one reel ring and one hidden ring totally contained within the chip. The dimensionality for the 
modified hypercube Is defined to be the value n from the previous sentence. 

We prefer to use a modified hypercube. This Is elaborated In the part of this application describing the 
technology. Finally, PMEs within a ring ana paired such that one moves data externally clockwise along a 
ring of nodes and the othsr moves data externally counterclockwise along (he ring of nodes, thus dedicating 

<o a PME to an external port. 

In our massively parallel machine, in our preferred embodiment me interconnection and broadcast of 
data and Instructions from one PME to another PME In the node and externally of the node to other nodes 
of a duster or PMEs of a massively parallel processing environment are performed by a programmable 
router, allowing reconfiguration and virtual flexibility to the network operations. This im portent feature is fuuy 

so distributed and embedded In the PME and allows for processor communication and data transfers among 
PMEs during operations of the system in SIMO and MiMO modes, as well as in the StMDYMMD and 
SPvllMD modes of operation. 

Within (he rings each interconnection leg Is a poinHo-polnt connection. Each PME has a point-to-point 
connection wtth the two neighboring PMEs in its ring and with two neighboring PMEs in two adjacent rings. 

55 Three of these point-to-point connections are internal to the node, while the fourth point-to-point connection 
is to an adjacent node. 

The massively parallel processing system uses the processing elements, wah ihetr local memory and 
interconnect topology to connect all processors to each other Embedded v/Hhin the PME is our fully 
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distributed I/O programmabfa router Our system also provides an addition to tha system which provides ihe 
ability to toad and unload all the processing elements. With our 2>pper we provide a method For loading end 
unioading of U>e airay of PEs and thus enable implementation of a fast I/O along an edge of the array's 
rings. To provide for external interface I/O any subset of the rings may be broken 4un-zipped across some 

s dtmenslon(s)) with the resultant broken paths connected to the external interface. The co-pending applica- 
tion entitled 'ttPAP I/O ZIPPER", filed concurrency herewith. USSN , fited May 22.1992. 
describes our 'zipper* In additional detail The 'Upper* can be applied to only the subset of links required to 
support the peak external I/O load, which in all configurations considered so far leads to Its being appSed 
only to one or two edges of the physical design. 

w The final type ol consists of data that must be broadcast to, or gathered from all PMEs. plus data 
which is too specialized to fit on the standard buses. Broadcast data includes commands, programs and 
data. Gathered data is primarily status and monitor functions whits diagnostic and test functions are the 
Bpedafized elements, Each node, in addition to the included sat of PMEs, contains one Broadcast and 
Control Interface (BCD section. 

*5 Consider PMEs interconnected In a modified 4 dimensional hypercube network. H each rang contains 18 
PMEs, than the system wla" have 32,768 PMEs. The network diameter Is 18 steps. Each PME contains In 
S/W the router and recoftngmation S/W to support a particular outgoing port Thus, software routing 
provides the capability to reconfigure in the event of a faulty processing element or node. Inherent In a ^d, 
25 Mh2 network design with byte wide half duplex rings to the provision for 410 gigabytes per second peak 

so internal bandwidft. 

The 4 dimensional hypercube leads to a particularly advantageous package. Eight of the PMEs 
(including data flow, memory and I/O paths and controls) are encompassed in a single chip. Thus, a node 
will be a single chip including pairs of elements along the rings. The nodes are configured together in an 8 
X S array to make up a cluster. The tufty populated machine is built up of an array of 0 X 8 clusters lo 

29 provide the maximum capacity of PMEs. 

Each PME is a powerful microcomputer having significant memory and CO functions. There is muttibyfe 
data flow within a reduced instruction set (RISC) architecture Each PME has 16 bit internal data flow and 
eight levels of program Interrupts wtti the use of working and general registers to manage date flow. There 
is a circuit switched and store and forward mode for I/O transfer under PME software control The SIMD 

30 mode or MIMD mode Is under PME software control. The PME can execute RISC Instructions from eltoer 
the BCI in a SIMD mode, or from He own main memory in MIMD mode. Specific RISC instruction code 
points can be reinterpreted to perform unique functions in the SIMD mode. Each PME can implement an 
extended Instruction Set Architecture and provide routings which perform macro level Instructions such as 
extended precision fixed point arithmetic, floating point arithmetic, vector arithmetic, and the like. This 

38 permits not only complex main to be handled but image processing activities for display of image data in 
multiple dimensions (2d and 3d images) and lor multimedia applications. The system can select groups of 
PMEs for & function. PMES assigned can allocate selected date and instructions for group processing- The 
operations can be externally monitored via the 601. Bach BO has a primary control input, a secondary 
control Input and a status monitor output for the node. Within a node the 2n PMEs can be connection for a 

<o binary hypercube cornmunicsracn network within the chip. Qommunicanon between PMEs is controlled by 
the biu> in PME control registers under control of PME software. This permits the system to have a virtual 
routing capability. Each PME can step messages up or down Its own right or to Its neighboring PME In 
either of two adjacent rings. Each interface between PMEs Is a point-to-point connection. The I/O ports 
permit off-chip extensions of the internal rinQ to adjacent nodes of tha system. The system is built up of 

«3 replications of a node to form a node array, a cluster, and other configurations. 

To complement our system's SIMD, MMD, SIMD/MIMD and SIMIMD functionality, our development we 
have provided unique operational modes. Among our SIMD7MIMD PMFs unique modes are the new 
Actional features referred to as the "store and toward / circuit ewSch" functions. These hardware functions 
complemented with the on chip communication and programmable internal and external I/O routing provides 

so Hie PME with very optimal data transferring capability. In preferred mode of operation the processor 
memory is generally the data sink for messages and data targeted at the PME in the store and forward 
mode. Messages and data not targeted tor the PME are sent directly to the required output pon when In 
circuit switched mode. The PME software performs the selected routing path while giving (he PME a 
dynamically sMctabto store and forward / circuit switch functionality. 

so Among the advances we have provided is a fully distributed architecture for PMEs of a node. Each 
node has 2n processors* memory and VO. Every PME will provide very flexible processing capeWfity with 
16 bit data flow, 64K bytes of focal storage, store and fonrvardtl^uif switch logic, PME to PME 
communication. SIMD/MIMD switching capabilities, programmable routing, and dedicated floaiing point 
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assist logic The organization of every PME and its communication paths with other PMEs xyfthin the same 
chip to minimize chip crossing delays. PME functions can bo Independently operated by the PME and 
integrated with function* in fre node, a cluster, and larger arrays. 

Our massively parallel system is made up of nodal building blocks of multiprocessor nodes, dusters of 

9 nodes, and arrays of PMEs already packaged In clusters. For control of these packaged systems whe 
provide a system array director which w*h the Iwrdware controllers performs the overall Processing 
Memory Element (PME) Array Controller functions In the massively parallel processing environment The 
Director comprises of three functional areas, the Appttceflon Interlace* the Cluster Synchronizer, end 
normally a Cluster Controller. The Array Director will have the overall control oi the PME array, using the 

to broadcast bus and our zipper connection to steer data and commands to all of the PMEs, The Array 
Director functions as a software system interacting wan the hardware to perform the role as the shell of the 
APAP operating system. 

Ihe interconnection for our PMEs for a massively parallel array computer SIMD/M1MD processing 
memory element (PME). interconnection provides the processor to processor oonneciign in the massively 

/s peratlei processing environment Each PME uaSzee our fully distributed frtterprocessor communication 
hardware from (he on-chip PME to PME connection, to the off-chip I/O facilities which support the chip-to- 
chip Interconnection Our modified topology Omits our cluster to cluster wiring while supporting the 
advantages of hypercube connections. 

The concepts which we employ for a PME node are related to Ihe VLSI packaging techniques used for 

20 the Advanced Parallel Array Processor (APAP) ccrnputer system disclosed here, which packaging features 
of our invention provide enhancements to the manufaciurhg ability of ihe APAP system. These techniques 
are unique m the area of massively parallel processor machines and wHl enable trie machine to be 
packaged and configured m optimal subsets that can be built and tested. 

The packaging techniques take edventege of Ihe eight PMEs packaged in a single chip and arranged in 

£0 a N-dlmensionaJ modified hypercube configuration. This chip level package or node of the array le the 
smallest building block in the APAP design. These nodes are then packaged in an e x 8 array where the +- 
X and the +-Y makes rings within ihe array or cluster and the *-to\ and +-Z are brought out id the 
neighboring dusters. A grouping of clusters make up an array. The intended applications tor APAP 
computers depend upon the particular con&guraSon and host Large systems attached to mainframes with 

so effective vectorised floating point processors might address special vectorizable problems - such as weather 
prediction, wind fum>el simulation, turbulent fluid modeling and finiie element modeling. Wliere these 
problems invoke sparse matrices, significant work must be done to prepare ihe data for vectorized 
arithmetic and likewise to store results. That workload would be off loaded to ihe APAP. bi ftLermacBate size 
systems, the APAP might be dedicated to perfonnlng the graphics operations associated with visualization, 

35 or with somo preprocessing operation on Incoming data Qua^ performing optimum assignment problems in 
military sensor fusion applications). Small systems attached to workstations or PCs might serve as 
programrner development siaiions or might emulate a vectorized floating point processor sTachment or a 
3d graphics processor, 



BRIEF DESCRIPTION OF THE DRAWINGS 

HO. 1 shows a parallel processor processing element like those which would utilize old technology. 

FIG. 2 shows a massively parallel processor building block m accordance wrih our invention, 
representing our new chip design. 
4* RG. 3 fflustrares on the right side the preferred chip physical cluster layout tor our preferred 
embodiment of a chip single node fine grained parallel processor. There each chip is a 
scalable parallel processor chip providing 5 MIPe performance with CMOS DRAM memory 
and logic permitting air cooled implementation of massive concurrent systems. On the left 
side of Figure 3, there is illustrated the replaced technology. 
30 FIG. 4 shows a computer processor functional block cfiagram In accordance with the invention. 

FIG. 5 shows a typical Advanced Parallel Array Processor computer system conSguration. 

FIO. 6 shows a system overview of our fine-grained parallel processor technology In accordance 
with our invention, illustrating system build up using replication of the PME element which 
rjermits systems to be developed with 40 to 193.840 Mips r^rrermance, 
58 FIG. 7 illustrates She hardware for the processing element (PME) data Bow and focal memory in 
accordance wrth our invention, whDe 

FIG. S illustrates PME data How where a processor mernory element is r^frgured as a hardwired 
general purpose computer that provides about 5 MIPS fixed point processing or A Mflops via 
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programmed control floating point operations. 
FIG. 9 snows the PME to PME connection (binary hyperojbe) find data paths mat can be taken In 

dcoorctenoo with our invention, wh3e 
FIG, 10 ilustrctes node interconnecBons for toe chip or node which has 8 PMEs, each of which 
s maneges a single external port end permits distribution of Uie network control function end 

eliminates a functksflei hardware port botiteneck. 
FIG. II Is a block diagram of a scalable parcel processor chip where each PME Is a 18 bit wide 

processor vtftb 02 K word 3 of local tnemoiy and there is 1/0 porting for a broadcast port 

nMch provides a ccntrolter-to-sll interface) while externa) ports are bidirectional point-t>poim 
70 interfaces permitting ring torus connections within the chip and externally. 

FI-3. 12 shows an array director in the preferred embodiment. 

FIG. 13 in part (a) frustrates me system bus to or ton a cluster array coupling enabling loading or 
unloading of the array by connecting; the edges of clusters to the system bus (see FIGURE 
14). In FIGURE 13 in pert <b) there is the bus to/from the processing element portion, 
is FIGURE 13 iltustrasss how multiple system buses can be supported with multiple ctuatsrs. 

Each cluster can support 50 to 57 Mbyte's bandwidth* 

FIG, 14 shows a lf zipper ,r connection for feet TO con»«crion, 

FIG, id shows an 0 degree hypercube connection illustrating a packaging technique in accordance 
wltti our invention applicable to an 8 degree hypercube. 

so FIG. 16 snows two independent node connections in the trypeicubs- 

FIG. 17 shows the Bftonic Sort algorithm as an example to iHustraie the advantages of the defined 

SifflD/MlMD processor system. 
FIG. id ilustrates a system block diagram for a host attached large system with one application 
processor interface illustrated. This fflusiration may also be viewed with rhe understanding 

23 that our invention may be employed in stand alone system* which use multiple epp5cetk>n 

processor interfaces. Such interfaces in a FIGURE 18 configuration will support 
DASD&raphics on all or many dusters. Workstation accelerators can eliminate the host 
application processor Bnterfece (API) and cluster synchronizer (OS) illustrated by emulation. 
The cs is not required in aa instances. 

30 FIG. 19 IB us [rates the software development environment Tor our system. Programs can be prepared 
by and executed from the host application processor. Bom program end machine debug is 
supported by the worhstotlon based console illustrated here and in FIGURE 22. Both of these 
services wtl support applications operating on a real or a simulated MMP. enabling applica- 
tions to be developed at a workstation level as well as on a supercomputer formed of tlte 

as APAP MMP. The common software environment enhances programmabiiity and distributed 

usage. 

FIG, 20 ilustrates the programming levels which are permiited by the new systems. As different 
users require more or toss detailed knowledge, the software system is developed to support 
this variation. At the highest level the user does not need to know the architecture is Indeed 
* m MMP, The system can be used with existing language systems for partitioning of 

programs, such as parallel Fortran. 

FIG. 21 illustrates the parallel Fortran compiler system for the MMP provided by the APAP configura- 
tions described. A sequential to parallel compiler system uses a combination of existing 
compiler capability with new data allocation functions and enables use of a partitioning 
45 program hKe FortranD, 

FIG. 22 Mustretes the workstation application of the APAP, where the APAP becomes a workstation 
accelerator. Note that the unit has the same physical etoe as a RISC/6000 Model 630, but 
mis model now contains an MMP which is attached to the workstation via a bus extension 
module illustrated, 

so FIG. 23 Ilustrates an application for an APAP MMP module for an AWACS military or commercial 
applicatioa This is a way or handling efhctenfly the classical distributed sensor fusion 
problem shown in FIGURE 23, where to© observation to track matching is classically done 
with well know algorithms like nearest neighbor* 2 dimensional linear assignment (Munkes..). 
probabilistic data association or multiple hypothesis testing, but these can now be dona m an 

«s improved manner as illustrated by FIGURES 24 and 25, 

FIG. 24 Illustrates hew the system provides the ability to handle n-dlmensJooal assignment problems 
in real time. 

FIG. 25 illustrates processing flow for an rKimensional assignment problem utiSzing an APAP. 
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RG. 26 illustrates the expansion una provided by the system enclosure described shewing how a 
unti can provide 424 MfbpS or 5120 MIPS using only 8 to 10 extended SEM-E modules, 
providing the performance comparable to that of speciaBzsd signal processor module in only 
.6 cubic feet This system can become a SIMD massive machine with 1024 parallel 
s processors performing two bllQon operations per second (GGPS) and can grow by adding 

1024 addraonal processors end 32MB additional ssorege, 
RG. 27 illustrates the APAP packaging far a supercomputer. Here is a large system of comparable 
performance but much smaller footprint than other systems. It can be built by recreating the 
APAP cluster wimin an enclosure tike those used lor smaller machines, 
m Wo have provided, as pari of (he description, Tables Itustraiing the hardwired instructions for a P ME, in 
which Table 1 illustrates Fixed-point erimmetlc instructions: Table 2 aiustrates storey to storage instruc- 
fiona; Table 3 illustrates logical instructions; Table 4 illustrates shift Instructions; Tabfc 5 illustrates branch 
insiructione: Table 6 illustrates the status switching instructions: and Table 7 illustrates the inputfautput 
instructions. 

is [Hate: For convenience of iltustrciion in the formal patent drawings, FIGURES may be separated in parts 
and as a convention we place the top of the FIGURE as the first sheet with subsequent sheets proceeding 
down and across when viewing the FIGURE, la the event that multiple sheets are used.) 

Our detailed description follows with parts explaining the preferred embodiments of our invention 
provided by way of example. 

DETAILED DESCRIPTION OF THE INVENTION 

Turning now to our Inveneon in greater detail, it will be seen from FIGURE 1, which illustrates me 
existing technology level. iJJualr&ted by (he transputer T0OO chip, and representing similar chips for such 
machines as tte illustrated by tfie Touchstone Delta (1680% N Cube C386). and others. When FIGURE 1 is 
compared with the developments here, it will be seen mat not only can systems like the prior systems be 
substantially improved by employing our invention, but also new powerful systems can be created, as we 
will describe. RGURE i's conventional modem microprocessor technology consumes pins and memory. 
Oandwidth is limited and inter-chip communfcanion drags the system down. 

30 The new technology leapfrog represented by FIGURE 2 merges processors, memory. Iffl into multiple 
PMEs {eight or more 16 bit processors each of which has no memory access delays and uses all fte pins 
far networking) formed on a single low power CMOS ORAM chip. The system can make use of ideas of our 
prior referenced disclosures as well as Invention separately described In the applications riled concurrerttly 
herewith and applicable to the system we describe here. Thus, far this purpose they are Inxxirporated herein 

38 by reference. Our concepts of grouping, autonomy, transparency, zipper interaction, asynchronous SlMD. 
Simimd or &MD/MIMD. can all be employed with (he new technology, even enough to lesser advantage 
they can be employed in the systems of the prior technology and in combination with our own prior multrpis 
picket processor. 

Our picket system can employ the present processor. Our basic concept Is ft at we have now provided 

40 a repjfc&ble brick, a new baste buBding block for systems win our new memory processor, a memory unit 
having embedded processors, router and VO, This bask building block is scalable. The basic system which 
we have implemented employs a 4 Meg. CMOS DRAM, it Is expendable to be used in larger memory 
configurations, with 18Mbit DRAMS, and WM&it ch'ps by expansion. Each processor Is a gate array. Wfft 
denser deposition, many more processors, at higher clock speeds, can be placed on the same chips and 

<$ using gates and additional memory win expand the performance of each pme. Scaling a single part type 
provides a system framwerk and architecture which can have a performance weB into the PETAOP rangs- 

F1GURE 2 illustrates the memory processor which we call the PME or processor memory element in 
accordance with our preferred embodiment The processor has eight or more processors. In the pictured 
embodiment there are eight The chip can be expanded (horizontally) to add more processors. The- crtfp 

so can. as preferred, retain Ute logic and expand the DRAM memory with additional cefts linearly (vertically). 
Pictured are 16 - 32k by 9 bit sections of DRAM memory surrounding a field of CMOS gate errey gates 
which implement d replications of a id bit wide dsrra flow processors. 

Using IBM 0MO8 low power sub-micron IBM CMOS deposition on silicon technology. It uses selected 
silicon with trench to provide significant storage on a smaii chip surface. Our memory and multiple 

55 processors organized interconnect is made with IBM's advanced art of making semiconductor chips. 
However* rt win be recognized that the iHtie chip we describe has about A Meg. memory, it is designed so 
that as 16 Meg. memory technology becomes stable, when improved yields and methods of accommodat- 
ing defects are certain, our little chip can migrate to larger memory sizes each 0 bits wide wilhout changing 
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the logic. Advances in photo and X-ray Sinography keep pushing minimum feature size to well below £ 
micron b. Our design envisions mora progress. These advances will permit placement of very large amounts 
of memory with processing on 41 single silicon chip. 

Our device is a 4 MEG CMOS OflAM belteved to be the ftrst general memory chip with extensive room 

s for logic. 18 replications of a 32k by 9-blt DRAM macro make up the memory array. The DRAM has 120 K 
cells it allocates with sigrWtearrt surface area for application bpjc on the chip, with triple level metal wiring. 
The processor fogto cells are preferably gate array cells. The 35 ns or less ORAM access time marches the 
processor cycle time. This CMOS Implementation provides logic density for e very effective PE (picket) and 
does so while dissipating 1.3 wstfe for the logic. The separate memory secflon of the chip, each 32K by 9 

?o bits, (with expansion not changing logic) surrounds the field of CMOS gate array gates representing 120K 
cell 9. and having the logic described other figures. iWemory Is barriered and with a separated power 
source dissipate* .9 waits. In providing the combining of significant amounts or logic on the same silicon 
substrate wtlh significant amounts of memory problems involved with the electrical noise incompatibility ol 
logic and DRAM have been overcome. Logic tends to be very noisy v*ile memory needs relative quiet to 

? 5 sense the millivolt size signals* that result from reading the cells of DRAM. We prefer to provide trenched 
fripto metal layer silicon deposition, wlm separate barriered portions of the memory chip devoted to memory 
and to processor logic with voltage and ground isolation, and separate power distribution and barriers! to 
achieve cempae'bfljty between logic and DRAM. 

20 APAP System Overview of Preferred BftjgpjngftjS 

This description introduces the new technology in (he following order: 

1, Technology 

2, Chip H/W description 

29 3. Networking and system build up 

4. Software 

5, Applications 

The initial sections of the detailed description describe how 4- Meg dram low power CMOS chips are made 
to include S processors on and as part or the manufactured PME DRAM chips each supporting: 

30 1. 10 bit 5 MIP dataflows* 

Z independent instruction stream and interrupt processing and 

& S bit (plus parity and controls) wide external port and interconnection to 3 other on chip processors. 
Our invention provides multiple functions which are Integrated into a single chip design. The chip will 
provide PME functions which a?e powerful and flexibfc and sufficient y so such that a chip having scalability 
3a will be elf active at processing, routing, storage and three classes ot WO. This chip has integrated memory 
and control logic within the single chip to make (he PME and this combination is replicated within me chip. 
A processor system is built from replications of me single chip. 

The approach partitions the km power CMOS DRAM. It will be formed as multiple word length (16) bit 
by 32K sections, associating one section with a processor. (We use the term PME to refer to a single 
40 processor, memory and I/O capable system unit.) This partitioning leads 10 each ORAM chip being an 8 
way 'cube connected' MIMO parallel processor with 8 byte wide independent interconnection ports. (See 
FIGURE a for an illustration of a replication of fine-grained parallel technology. illustrating replication end the 
ring torus possioiisies.) 

The software description addresses severs] distinct program types. Ai me lowest level processes 
<* mserface the user's program (or services caned by the application) to the detailed hardware H/w needs. 
This level includes *e tasks required to rmsiage the 10 and irrterprecessor synchronization and is what 
might be called a microprogram for the MPP. An intermediate level erf services provide for both mapping 
applications {developed wrih vector or matrix operations} to the MPP, and also control, synchronization, 
startup, diagnostic functions* At the host love), high order languages are supported by library functions thai 
so support wectorized programs with either simple automatic data allocation to Che MPP or user tuned data 
allocate n The multMevel software &W approach permits appBctfions to exploit different degrees of control 
and optimization within a single program. Thus, a user can code application programs without understanding 
the architecture detail while an optimizer might tune at the microcode level only the small high usage 
kernels of a program. 

ss Sections of our description that describe 1024 element 6 QIPS units and a 32,766 element 164 OIP8 
umt illustrate the range of possible systems. However, those are not the limits; both smafler and larger unto 
are feasible. Tne*s particular sizes have been selected as examples because the small unit is suitable to 
microprocessor? (accelerators), personal computers, workstation and military applteafions (using of coarse 
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different packaging techniques), while the larger unit is iitustrafive of a mainframe application as a module or 
complete supercomputer system. A software description win provide examples of other challenging work 
that might be effectively programmed on each of the illustrative systems* 

s PME DRAM CMOS ■ A BASE FOR A MULTIPROCESSOR PME 

FIGURE 2 illustrates our technology Improvement at me chip technology level. This extendable 
computer organization Is very cost and performance efficient over the wide range of system sizes because 
ft usee only one chip type. Combining the memory and processing on one crtp eliminates the pine 

70 dedicated to the memory bus and their associated reliability and performance penalties. RepScatton of our 
design within 9)0 chip makes ft economical/ feasible to consider custom logic designs for processor 
subsections- Replication of the chip wahin the system leads to large scale manufacturing economies. 
Finally. CMOS technology requires low power per MIP> which in turn minimizes power supply and cooling 
needs. The chip architecture can be programmed for multiple word lengths enabling operations to be 

is performed that would otherwise require much larger length processors, bi combtnaoon these attributes 
permit the extensive range of system performance. 

Our new technology can be compared with a possible extension of the old teciwology it overlaps. It is 
apparent that the advantages of smaller features have been used by processor designers to construct more 
complex chips and by memory designers (o provide greater replication of the simple element If the trend 

so continues one could expect memories to get four times as large while processors might exploit density toe 

1, Include multiple execute units with instruction rooters, 

2, increase cache size* and associative capability and/or 

3, increase instruction look ahead and advance computation capability. 

However, these approaches to Ihe okl technology illustrated by FIGURE t all tend to dead end. 

£0 Duplicating processors leads to linearly Increasing pin requirements but pins per chip is fixed- Better cache- 
ing can only exploit the application's data reuse pattern. Beyond mat memory bandwidth becomes the limit 
Application data dependencies and branching limit in© potential advantage of look ahead schemes. 
Additionally , it is not apparent that MPP applications with fine-grained parallelism need 1* 4. or 16 
Megsword memories per* processing unit Attempting to share such large memories between multiple 

so processors results In severe memory bandwtdSi limitations. 

Our new approach is not dead ended. We combine both significant memory and I/O and processor info 
a single chip, as illustrated by the RGU RE 2 and subsequent illustration and description, ft reduces part 
number requirements and eliminates the delays associated with chip crossing. More importantly, this 
permits all the cWp T s to pins to be dedicated to infeiproceseor communication and thus, maximizes 

35 network bandwidth. 

To implement our preferred embodiment illustrated in FIGURE 2 we use a process Uial Is available now. 
using IBM low power CMOS technology. Our illustrated enttodiment can be made writ* CMOS DRAM 
density, in CM08 and can bo implemented in denser CMOS. Our illustrated embodiment of 82K memory 
cells for each of 8 PMEs on a chip can be increased S9 CMOS becomes denser. In our embodiment we 

o utilize the feel estate and process technology for a 4 MEG CMOS DRAM, and expand mis with processor 
replication associated wiih $2K memory on the chip itself. The chip, it wis 1 be seen has processor, memory, 
and I/O in each of the chip packages of the cluster shown in FIGURE 3* Within each package is a memory 
wfth embedded processor element, router, and HO. all contained In a 4 MEG CMOS DRAM befleved to be 
the first general memory chip wBh extensive room for logic. U uses selected silicon wiih trench lo provide 

<s significant storage on a small chip surface. Each processor chip of our design alternatively can be mede 
w&n 1 6 replications of a 32K by 9 bft ORAM macro (35/8D ns) using 47 micron CMOS logic to make up the 
memory array. The device is unique in that it allocates surface area for 120 K cells of application logic on 
the chip, supported by the capability of triple level metal wiring. The multiple cards of the old technology is 
shown crossed out on the left side of FIGURE 3. 

so Our basic nepllcable element brick technology Is an answer to the dd technology. II one considered the 
"Xed" technology on the ten of FIGURE & one would see too many chips, too many cards, and waste. For 
example, today's proposed teraf lop machines That oihers offer would have literally a minion or more chips In 
them. Wiih todays other technology only a few percent of these chips, at best, are truly operations 
producers. The rest are "overhead" (typicany mernory. network interface, etc.). 

ss It will become evident that it is not feasible to package such chips, in such a large number, in anything 
that must operate in a constrained environment of physical else. (How many could you U in a smai area of 
a cockpit?} Furthermore, such proposed feratop machines of others, already huge, must scale up lOOGx 
times to reach the petaop range. We have a solution which dramatically decreases the percent of non- 
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operarfons product! ng chips. Yrb provide Increased bandwidth. We provide this within a reasonable network 
cflmens tonality. With such a brick technology, whore memory becomes the operator, and nehwrts are used 
for passing coitsrole, where operations producing chip$ are dramatically increased- In addition, the upgrade 
dramatically reduces the number of different types of chips. Our system is designed for scate-up, without a 

5 requirement for specialized packaging, cooling, power, or environmental constraints. 

Witt our brick technology, uaiizing instead of separate processors, memory units with bum in 
processors and network capsbtiiiy, me configuration shown In FIGURE 3, representing a card, wKh chips 
which are pin compatable with current 4Mbrt DRAM cards at the connector level Such a single card could 
hold, with a design point o* a basic 40 mto per chip parforrr<ence level, 32 chips, or 1280 mips. Four such 

?o cards would provide 5 gipa The -workstation configuration which ts illustrated woupd preferably have such a 
PE memory array, a duster controller, and an IBM RISC $y stem/6000 which has sufficient performance to 
run and monitor execution of an array processor application developed at the wortsfafion- 

A very gate efficient processor can be used in the processor portion. Such designs for processors have 
been employed, but never within memory. Indeed, In addition, we have provided the ability to mot MlMD 

f 6 and SIMD bask: operation provisions. Our chip provkfos a "broadcast bus" which provides an alternate oath 
true each CPU's instruction buffer. Our citsster controller Issues commands to each of the PEs in the PMEs, 
and these can be stored in the PME to control their operation in one mode or another. Each PME does no? 
have to store an ©mire program, but can store only those portions applicable to a given tack at various 
times during processing of an application. 

ao Given the basic device one can elect to develop a single processor memory combination. Alfemaavely, 
by using a more ample processor and a subset of the memory macros one can design for either 2, 4-, 8 or 
16 replications of the basic processing clement (PME?. The PME can be made simpler either by adjusting 
the dataflow bandwidth or by substituting processor cycles tor functional accelerators. For most embodi- 
ments we prefer to make 8 replications of the basic processing element we describe, 

29 Our application studies have indicated that (or now the most favorable answer la 8 replications of a ie 
bit wide data flow and 32K word memory. We conclude this because: 
1 . 16 bit words permit single cycte fetch of instructions and addresses. 

2. 8 PMEs each with an external port permits 4 dlmenslonoi torus inter con flections, using 4 ci 8 PMEs 
on each ring leads to modules suitable for the range of targeted system performances, 
ao 3, 8 external ports requires about 50% of (he chip pins, providing sufficient remainder tor power, ground 
and common control signals. 

4, 8 Processors implemented in a 64 KByte Metn Store* 

a. allows for a register baaed archdsciure rather than a memory mapped architecture, and il 

b. forces some desirable but not required accelerators to be implemented by mufiiple processor 
aa cycles. 

This last attribute is important because it permits use of the developing regie density increase- Our new 
accelerators (ex. floating point arithmetic unit par PME} ere added as chip hardware without affecting 
system design, pins and cables or application code- 

The resultant chip layout and stee (i4.$9 x 14.93 mm) la shown in FIGURE 2, and FIGURE 3 gbows a 

40 duster ot such chipe. which can be packaged in systems Bko those shown in later FIGURES for stand alone 
units, workstations which slide next to a workstation host with a connection bus, in AWACs applications, and 
in supercomputers. This chip technology provides a number of system level advantages, it permits 
deveiopmem of the scalable MPP by basic replication ot a single part type. The two DRAM macros per 
processor provide sufficient storage for both data and program. An SRAM of equivalent size might consume 

43 more man 10 times more power. This advantage permits mimd machine models rather than the more 
limited SIMD models characteristic ot machines with single chip processor/memory designs- The 35 ns or 
less DRAM access time matches the expected processor cycle time. CMOS logic provides the logic density 
for a very effective PME and doea so while dissipaang only 1.3 watts. (Total chip power is 1.3 + .9 
(memory; = 22 wj Those features in turn permit using tile chip in MIL applications requiring conduction 

so cooling. (Air cooling in non-MIL applications Is significantly easier.) However, the air cooled embodiment can 
be used for workstation and other environments A standalone processor might be configured wrm an 80 
amp • 5 volt power supply. 

Advanced Parallel Array Processor <APAP) budding blocks are shown In FIGURE 4 and In FIGURE 5. 
FIGURE 4 illustrate* me tunctronat block diagram of tns Advanced Parallel Array Processor. Multiple 

S3 application interfaces 150, 160, 170, 180 exist lor the application processor 100 or processors 1 10, 120. 
130. FIGURE 5 liuatrates the basic building btocfcs that can be configured Into different system block 
diagrams. The APAP, In a rnaximum configuration, can incorporate 32,768 identical PMEs. The processor 
consisls of the PME Array 280. 230. 300, 310. an Array Director £50 and an Application Processor Interface 
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260 for the application processor 200 or processors 210, 220, 230. The Array Director 250 consists of three 
functional unto: Application Processor Interf&oe 260 k cluster Synchronizer 270 and duster ComroSer 270. 
An Array Director car) perform the functions of (ha array controller of our prior linear picket system lor SIMD 
operations with MIMD capability. The cluster controller 270, along with a Got of 04 Array clusters 200, 290, 

s 300. 310. (Le. duster of 512 PMEs}. is the basic building block of the APAP computer system. The 
elemef its of the Array Director 250 permit configuring systems wnh a wide range of duster replications. This 
modularity based upon strict replication of both processing and control elements Is unique to this massively 
parallel computer system, in addition, the Application Processor interface 290 supports the Test/Debug 
device 240 which wO accomplish important design, debug, and monitoring functions, 

10 Controllers are assembled wtih a watt-defined interlace, e,g, lOMs MicroChannel, used in other systems 
today* including comroBers with £80 processors. Held rxcoramrnawe gate arrays add functions to the 
controller which can be changed to meet a particular corrfiguiat'on's requirements (how many PMEs there 
are. their couplings, etc) 

The PME arrays 280, 280, 300, 310 contain the functions needed » operate as either SIMD or MIMD 
;s devices. They also contain functions that permit the complete set of PMEs to be divided inrao 1 to 250 
distinct subsets. When divided Into subsets the Array Director 250 Interleaves between subsets. The 
sequence of the interleave process and the amount of cot tired exercised over each subset is program 
controlled. This capability to operate distinct subsets of the array in one mode, i,e„ MIMD with differing 
programs, while other sets operate In tightly synchronised SIMD mode under Array Director control! 
20 represents an advance in the art- Several examples presented later illustrate the advantages of the concept 

Array Architecture 

The set of nodes forming the Array is connected as a n-olrnensionel modified by per cube. In that 
£S interconnection scheme, each node has direct connections to 2n other nodes. Those connections can be 
either simplex, halt-duplex or fun-duplex type paths. In any dimension greater than 3d, the modified 
hypercube is a new concept in Interconnection techniques (The modified hypercube in the 2d case 
generates a torus* and in the 3d case an orthogonafly connected letSce wUh edge surfaces wrapped to 
cppo&no suffice,) 

30 To describe the interconnection scheme for greater than 3d cases requires an inductive description. A 
set of mi nodes can be interconnected as a ring. (The ring could be 'simply connected 1 , 'braided', 'cross 
connected', "fully connected*, e& Although additional node ports are needed for greater than simple rings, 
that added complexity does not affect the modified hypercube structure.} The m? rings can (hen be linked 
together by connecting each equivalent node In the m* set of rings* The result at this point is a torus. To 

3s construct a i+ Id mocBfied hypercube from an id modified hypercube, m* i sets of Id modified hypercubes 
and interconnect an of the equivalent m ; level nodes into rings- 

This process is illustrated tor the 4d modified hypercube, using mi = 8 for i - i „4 by the illustration in 
FIGURE & Compare our description under node Topology and also FIGURES 6, 0, 10, 15 and 16\ 

FIGURE 8 illustrates the fine-grained parallel technology path from the single processor element 300, 

<o made up of 32K ie-bft words with a 16-bit processor to the Network node 310 of eight processors 312 and 
their associated memory 31 1 with their fully distributed LO routers 013 and Signal 1*0 ports 314, 315, on 
through groups of nodes labeled clusters 320 and Into the cluster configuration 360 and to the various 
applications 330, 340, 350, 370- The 2d level structure is the cluster 320, and &4 clusters are integrated to 
form the 4d modified hypercube of 32,768 Processing Elements 360. 

<s 

Processing Array Element (PMC) Preferred Embodiment 

As IBustiated by FIGURE 2 and FIGURE n the preferred apap has a basic building block of a one 
chip node. Each node contains & fetentice! processor memory elements (PMEs) end one broadcast and 

30 control interface (BCl). While some of our inventions may be implemented when ail functions are not on the 
same chip, it is important irom a performance and cost reduction standpoint to provide the chip as a ons 
chip node with the 8 processor memory elements using the advanced technology which we have described 
and can be implemented today. 

The preferred implementation of a PME has a KByte main store, 1 6 1©-Wt general registers on each 

55 of d program interrupt levels, a full function arithmetics/logic unit (ALU) with working registers, a statue 
register, and four programmable oHftreottonaJ I/O ports. In addition me preferred Implementation provides a 
SIMD mode broadcast interface via the broadcast and conW interface (BCI) which allows an external 
conurotler (see our original parent application and the description of our currently preferred embodiment for 
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a nodal array and system wrih dusters) to drive PME operation decode, memory address, and ALU data 
Inputs. This chip can perform the funcUone ol a microcomputer allowing multiple parage) operations to be 
performed within it, and it can be coupled to other chips within a system of multiple nodes, whether by an 
interconnection network a mesh or hyporcubo network, or our preferred and advanced scalable embedi- 
s ment 

The PMEs are imerccnnecfed in a series of rings or tori in our preferred scalable embodiment. In some 
applications the nodes could be interconnected in a mesh, in our preferred embodiment each node contains 
two PMEs In each of four tori. The tort are denoted WXY, and Z (see FIGURE 6). FIGURE 11 depicts the 
Interconnection of PMEs wKhin a nods. The two PMEs In each torus are designated by ttefr external VO 
jo port i> W, -W, +X, -X, +Y. -V, +Z, -2), Within the node, there are aJso two rings which interconnect lha 4 
+n and 4 -n PMEs. Tfteee Internal rings provide the path tor messages to move between the external tori 

Since the APAP can be in our preferred errJtodiment a tour dimensional orthogonal array, the internal 
rings allow messages to move throughout trie array in all dimensions. 

The PMEs are self-contained stored program microcomputers comprising a main store, local store, 
55 operaion decode, arithmetic-logic unit (ALU), working registers and I rputf Output VQ ports. The PMEs have 
the capability of fetching and executing stored Instructions from (heir own main store in MIMD operation or 
to (etch and execute commands via the Bd interface tn SIMD mode. This interlace permits irnercomrmrni- 
catlon among the controller, the PME, and other PMEs in a system made up of multiple chips. 

The BCI Is the node's Interface to me external array controller element and to an array director. The 
to BCI provides common node functions such as timers and docks. The bci provides broadcast function 
masking lor each nodal PME and provides the physical interface and buttering tor tho broadcast-bue-to- 
PME data transfers, and also provides (he nodal interface to system status and monitoring and debug 
elements. 

Each PME contains separate rolemipt levels to support each of its poinMo-poirrt interfaces and She 

25 broadcast Interlace* Data is input to the PME main store or output from PME main store under Direct 
Memory Access (DMA) control. An input transfer complete 1 " interrupt is available for each of the interfaces 
to signal the PME software that data is present. Statue information is available for the software to determine 
the completion of data output operations. 

Each PME has a "circuit switched mode" of I/O in which one of its four input pores can be swiicfted 

30 directly to ones ot its Four ouiput ports, without having the data enter the PME main store. Selection of the 
source and destination of me "circuit swrich" is under control of the software executing on the PME Tlie 
other thres input porta continue to have access to PME main store functions, white the fourth input is 
switched to an output port 

An addiaonal type of I/O has data that must be broadcast to, or gathered from all PMEs, plus data 

33 which is too specialized to fit on the standard buses. Broadcast data can indude SIMD commands, MIMD 
programs, and SIMD data. Gathered data is primarily status and monitor function?. Diagnostic and test 
functions are the specialized data elements. Bach node, in addition to the included set of PMEs, contains 
one BCI. During operations the 601 section monitors the broadcast interface and steoraecIlodB broadcast 
data to/from the addressed PME/s). A combination of enabling masks and addressing tags are used by the 

40 BCI to determine what broadcast foto matron is intended for which PMEs. 

Each PME is capable of operating in 3IM0 or in MIMD mode in our preferred embodiment In SIMD 
mode. each Instruction is ted into the PME from the broadcast bus via the BCI. The BCI buffers each 
broadcast data word unlit at of its selected nodal PMEs have used it This syrxtfronizsrSon provides 
accommodation of the data Gming dependencies associated with the execution of SIMD commands and 

<b anowa asynchronous operations to be performed by a pme. in mimd mode, each PME executes its own 
program from Its own main store. The PMEs are initialized to the SIMD mode- For MIMD operations, the 
external controller normally broadcasts the program to each of the PMEs while Ihsy are in SIMD mode, and 
then oommands the PMEs to switch to MIMD mode and begin executing. MasMngfcggtng the broadcast 
information allows dnferont sets of PMEs to contain different MIMD programs, and. or selected sets of PMEs 

so to operate In MIMD mode while other sets ol PMEs execute In SIMD mode. In various software clusters or 
partitions these separate functions can operate independently of the actions in other clusters or partitions. 
The operation of &e Instruction Set Architecture (ISA) of the PME will vary slightly depending on whether 
the PME is in the SIMD or MIMD mode. Most ISA instructions operate Identically regardless of mode. 
However, since the PME in SIMD mode dees not perform fxanching or other control function* some cede 

55 points dedicated to those MIMD tnstftjetior* are reinterpreted in SiMD mode to allow the PME to perform 
special operations such as searching main memory for s match to a broadcast data value cr switching to 
MIMD mode. This further extends system flexibility of an array. 
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PME Architecture 

Basically, our preferred architecture comprises a PME which has a 10 bit wide date flow, 32K of 10 bit 
memory, specialized I/O porte and L'O switching paths, plus the necessary control logic to permS each PME 

9 to fetch, decode aid execuie the 13 bit Instruction set provided by our instruction set architecture (ISA). 
The preferred PME performs the funcfcoiw of a virtual router, and thus performs com the processing 
functions and data roursr functions. The memory organization allows by cross addressing of memory 
between PMEs access to a large random access memory, and direct memory for the PME. The indMdual 
PME memory can be all local or divided into local and shared global areas programrnaticalfy. Specialized 

70 controls and capabilities which we describe permit rapid task switching and retention of program state 
information at each of the PMEs interrupt execution levels. Although some of the capabilities we provide 
have existed in other processors, their application for management of interproceesor UO is unique in 
massively parallel machines. An example is the integrate of the message router function into the PME itself. 
This elmlnetes specialized router chips or development of speclalzed VLSI routers. We also recognize that 

js in some instances one could distribute the functions we provide on a single chip onto several chips 
interconnected by metafizaticn layers or otherwise and accomplish improvements to massively parallel 
machines. Further, as our architecture is scalable from a single node to massively parallel supercomputer 
level machines, ft is possible to utilize some of our concepts at different levels. As we will illustrate lor 
example our PME data flow Is very powerful and yet operates to make the scalable design effective, 

^0 The PME processing memory element develops for each of the multiple PMEs of a node, a fu«y 
distributed architecture. Every PME will be comprised of processing capabffity with 16 bit data now, 64K 
bytes of local storage, store and forward/circuit switch logic, PME to PME communication, Simd/mimd 
switching capabilities, programmable routing, and dedicated Heating point assist logic. These functions can 
be independently operated by the PME and integrated with other PMEs within the seme chip to minimize 

£S chip crossing delays* Referring to FIGURES 7 and 8 we Illustrate the PME dataflow. The PME consists of 
16 Off wdd dataflow 425. 435, 445, 455, 465, 32K by 16 bit memory 420, Specialized 1*0 peris 400, 410, 
460, 430 and I/O switching paths 425, plus the necessary control logic to permit the PME to fetch, decode 
and execute a 16 bit reduced instructor set 430. 440, 450, 480. The special logic also permits fte PME to 
perform as bath the processing una 460 and deta router. Specialized control* 405, 406, 407, 408 and 

30 capabilities are Incorporated to permit rapid task switching and retention of program state Information at 
each of ihe PMEs" interrupt execution levels. Such capabilities have been included in other processors; 
however, their application specifically for management of Irrterproceesor I/O is unique in massively parallel 
machines, 6peclficalry. H permits the integration of the router function Into Ihe PME without requiring 
specialized chips or VLSI development macros- 

35 

1 6 bit internal data flow and control 

The major parts of the internal data flow of the processing element are shown in FIGURE 7, FIGURE 7 
Illustrates the Internal data flow of the processing element This processing element has a full 18 bit internal 

40 data flow 425. 435, 445. 455, 465, The Important paths of the Internal data flows use 12 nanosecond hard 
registers such as {he OP register 450, M register 440, WR register 470, and (he program counter PC 
register 430. These registers feed the tuny distributed ALU 460 and I/O router registers and logic 405. 409, 
407, 400 for all operations. Wnh current VLSI technology, the processor can execute memory operations 
and instruction steps at 25 Mhz> and it can build the important elements, OP regisier 450, M register 440, 

* WR register 47<K and the PC register 430 wtft 12 nanosecond hard registers- Other recced registers are 
mapped to memory locations. 

As seen in FIGURE 8 toe internal data Pow of the PME has to 32K by 18 bit main store in the form of 
two DRAM macros. The remainder of the data flow consists of CMOS gate anay macros. All of the memory 
can be formed with the logic wfth low power CMOS ORAM dsposftion techniques to form an very large 

so scaled Integrated PME chip node. The PME Is repDcated 8 times In the preferred embodiment of the node 
chip. The PME data flow consist* or a 16 word by 18 bit general register stack, a muffl-tuncaon 
arithmetic^ logic unit (ALU) working registers to buffer memory addresses, memory output registers, ALU 
output registers, operation /command, I/O output register and multiplexors to select inputs to the ALU and 
registers. Current CMOS VLSI technology for 4MByte DRAM memory v\m our logic permits a PME to 

55 execute instruction step* at 25Mhz. We are providing the OP register, ihe M register, the WR register and 
the general register stack with 12 nanosecond hard registers. Other required registers are mapped to 
memory locations xvithm a PME, 
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The PME data flow is dssigned as e 16 bit bxbager arithmetic processor. Special multiplexor pzrihs have 
been added to optimize subroutine emulation of n x 18 bit floating point operations (n»>1). The 16 bit data 
flow permits effective emulation of floating point operations. Specific paths within the 6m flow h?j& bsen 
included to parmri floating point operations in as little as 10 cycles. The ISA inciudes special code point to 

s permit subroutines lor extended (longer than 16-bit) operand operations Tlte subsequent floating point 
performance Is approximately one twentieth tne fixed fleeting point performance. This performance is 
adequate to eliminate the need tor special floating point chips augmenting the PME as is characteristic of 
other massively parallel machines. Some other processor* do include the special floating point processor* 
on me same chip as a single processor (See FIGURE 1). We can enable special flooring point hardware 

w processors on the seme chip with our PMEs but we would now need additional cells than is required for the 
preferred embodiment. For floating point operations, see also the concurrently filed FLOATINQ POINT 
application referenced above ior improvements to the IEEE standard. 

The approach developed is wen poised to take advantage of the normal Increases in VLSI technology 
performance. As circuit size shrinks and greater packaging density becomes possible then data flow 

'5 elements Idee bese and index registers, currently mapped to memory could be moved to hardware. 
Likewise* floating point sub-steps are accelerated with additional hardware which we wifl prefer for the 
developing CMOS DRAM technology as reflate hirjher density levels are achieved. Very importantly, this 
hardware alternative does not affect any software. 

The PME is Initiated to SIMD mode with interrupts disabled. Commands are ted Into the PME 

to operation decode buffer from the bci Each lime en instruction operation completes, the PME requests a 
new command from the B<X m a similar manner, immediate data is requested from the BCI at the 
appropriate point In the Instruction execution cycle. Most instructions of the ISA operate identically whether 
the PME is (n SIMD mode or in MIMD mode, with the exception of that SIMD instructions and immediate 
data are taken from the BCI; in MIMD mode Ihe PME maintains a program counter (PC) and usee mat as 

29 the address within tfe own memory to fetch a 19 bit instruction. Instructions such as " Branch F which 
explicitly address the program counter have no meaning in SIMD mode and some of those code points are 
reinterpreted to perform special fclMD functions as comparing Immediate data against an area of main store 

The PME instruction decode logic permits either SIMD/MIMD operational modes, and PMEs can 
transition between modes dynamically, in SIMD mode the PME received decoded instruction information 

30 and executes thai data in ihs next ciocfr cycle. In MIMD mode the PME maintains a program counter PC 
address and uses that as the address within ria own memory to fetch a 18 bri instruction. Instruction decode 
and execution proceeds as in most any other RISC type machine. A PME in SlMO mode enters MIMD 
mode whBn given the information ihat represents a decode brand). A PME In MIMD mode enters (he SIMD 
mode upon executing a specific instruction for the transition. 

so When PMEs transition dynamically between SIMD and MIMD modes, an MIMD mode is entered by 
execution of a SIMD "write control register' instruction with the appropriate control Wt set to a P 1 At the 
completion of the SIMD instruction, me PME enters the MIMD mode, enables interrupts, and begins 
fetching and executing its MIMD instructions from the main store location specified by its general register 
RQl interrupts are masked or unmasked depending on the state of interrupt masks when the MIMD control 

4$ bit is eet The PME returns to SIMD mode either by being externally reinitialized or by executing a MIMD 
"write control register" instruction with the appropriate control bit set to aero. 

Data communication pams and control 

48 Returning to Figure 7 *rt will be seen that each pme has 3 input ports 400, and 3 output pone 4S0 
intended for on-chip communication plus 1 K) port 410, for off chip communications. Existing 
technology, rather than (he processor idea, require* (hat the off-chip port be byte wide half duplex. Input 
ports ere connected such that data may be touted from input to memory, or from input AR register 406 to 
output register 408 via direct 16 brt data path 425, Memory would be the date sink tor messages targeted at 

so the PME or for messages that were moved in 'store and forward' mode. Messages lhal do not target the 
particular PME are sent directly to the required output port, providing a 'circuit switched 1 mode, when 
Hocking has not occurred The PME S.V/ is charged with performing the routing and determining the 
selected transmission mode. This makes dynamically selecting between 'circuited switched* and 'store end 
forward' modes possible- Tnis Is also another unique characteristic of the PME design, 

as Thus, our preferred node has 8 PMEs end each PME has 4- output ports {Left. Right Vertical, and 
External). Three of the input ports and three of the output ports ere ifrbit wide full duplex polnHo-point 
connections to the other PME* on the chip. The fourth ports are combined in the preferred emcodirnsnt to 
provide a half duplex poinHo-poinI connection to an off-chip PME. Due to pin and power constraints mat we 
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Nave mn posed to make use of the less dense CMOS we employ, the actual ofr-chip interface is a byte-wide 
path which Is used to multiplex two haJvee of the Inter-PME data word. With special "2fcpper n circuitry which 
provides a dynamic, temporary logical breaking of iiftrmodsA rings to allow data to enter or leave an array, 
these external PME ports provkla the APAP externa* lO array function. 
9 For data routed to the PME memory, normal DMA Is supported such that die PME Instruction stream 
must become Involved in the VO processing only at me beginning and end ol messages. Rnairy. data ftai is 
being 'circuit switched to an internal output pon is forwarded without docking. This permtes single cycta 
data transfers within a chip and detects when chip crossings will occur such that the fastest out still reliable 
commurtcatton can occur. Fast forwarding utilizes forward data paths and backward control paths, all 

70 operating in transparent mods. In essence, a source looks through several stages to see the acknowledg- 
ments from ale PME performing a DMA or ofKntp transfer. 

As seen by FIGURES 7 and 8 Data on a PME input port may be destined *or the local PME, or for a 
PME further down the ring. Data destined for a PME further down the ring may be stored in the local PME 
main memory and then forwarded by the local PME towards the target PME {store and forward), or the local 

/s input port may be logically connected Id a particular local output port (circuit switched) such that tie das 
passes "transparently" 1 through the local PME on its way to the target PME. Local PME software 
dynamically controls whether or not the local PME is in -store and forward* mode or in "circuit switched" 
mode tor any of the four inputs and four outputs, to circuit switched mode, the PME concurrently processes 
all functions except the I/O associated wish 8te circuit switch: In store and forward mode the PME suspends 

so all other processing functions to begin the 1*0 Awarding process. 

While data may be stored externaSy of the anay in a shared memory or DASO (with external controller), 
rt may be stored anywhere in the memories provided by PMEs. input data destined lor the local PME or 
buffered m the local PME during "store and forward" operations is placed into local PME main memory via 
a direct memory access (address) mechanism associated with each of the input ports. A program interrupt 

so is available to Indicate that a message has been loaded into PME main memory. The local PME program 
interprets header data to determine if the data destined for the local PME is a control message which can 
be used to set up a drautt»swiiched path to another PME. or whether it is a message to be forwarded to 
another PME Circuit switched paths *e controlled by local PME software. A circuit switched path logically 
couples a PME input pain directly to an output path without passing through any intervening buffer storage. 

30 Since tha output paths between PMEs on the same chip have no Intervening buffer storage* daia can enter 
the chip, pass through a number of PMEs on the chip and be loaded into a target PMFs main memory in a 
single clock oyclel Only when a circuit switch combination leaves the chip, is en intermediate buffer storage 
required. This reduces the effective diameter of the APAP array by a number of unbuffered circuit switched 
paths. As a result data can be sent from a PME to a target pme in as few clock cycles as there are 

re Intervening chips, regardless of the number of PMEs in the path. This kind of routing can be compered to a 
switched emironmeni In which at each node cycles are required to carry data on to the next node. Each of 
our nodes has 8 PMEs! 

Memory and Interrupt Levels 

40 

The PME contains 33< by 16 bit 420 dedicated storage words. This storage » completely general and 
can contain both data and program. In S1MD operations all of memory could be data as Is characteristic of 
other SIMD massively parallel machines, m MIMD modes, the memory is quite normal; but, unlike most 
massively parallel MIMD machines the memory is on the same chip wift the PME and is thus, immediately 

« available. This men eliminates the need for cache-tag end cache coherency wchnkjuee characteristic of 
other massively parallel MIMO machines. In the case for instance of Inmos's cWp, only 4K resides on the 
chip, and external memory interface bus and pins are required. These are eflrrinated by us. 

Low order storage locator are used to provide e set of general purpose registers for each interrupt 
level. The particular ISA developed for the PME uees short address fields for these register references. 

no Interrupts are utilized to manage processing, I/O activities and &W specified functions (i.e.. a PME in 
normal processing will switch to an interrupt level when incoming to initiates), if the level is not masked, 
the switch is made by changing a primer In HW such that registers are accessed from a new section of 
low order memory and by swapping a single PC value. Thfs technique permits last level switching end 
permits &w to avoid the normal register save operations and also to save status within the interrupt level 

58 registers. 

The PME processor operates on one of eight program interrupt levels. Memory addressing permits a 
partitioning of the lower 576 words o* memory amoung the eight levels of interrupts, &4 of these 578 words 
of memory are directly addressable by programs executing on any of the eight levels. The other 512 words 
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are partitioned Into eight 64 word segments. Each 64 word segment fa directly accessible only by programs 
executing on Kb associated Interrupt level. Indirect addressing techniques are employed tor allowing all 
programs io access $JI 32K words of pme memory. 

The interrupt levels are assigned to the input ports, the BCi, and to error handing. There is a *normd D 
s level but there Is no "privileged "> nor "supervisor" leveL A program interrupt causes a context switch In 
which the oontems of the PC program counter, stalus?control register, end selected general registers are 
stored In specified main memory locations and new values for these registers are fetched from other 
speeded main memory locations. 

The PME data flow discussed with reference to FIGURES 7 and 8. may be amplified by reference to 
70 the additional sections below. In a cornptox system, the PME data flow uses the combination of the chip as 
an anrav node with memory, processor and I/O which sends and receives messages wKh the BCl that we 
replicate as the basic building block of an MMP bum wht> our APAP, The MMP can handle many word 
lengths. 

>5 PME MuKpie Length Data Flow Processing 

Tiie system we describe can perform the operations handled by current processors wift the data flow In 
the PME which is 10 bits wide, This is done by performing operations on data lengths which are multiples 
of 16 bite. This Is accomplished by doing the operation In 19 bit pieces. One may need to know the result 
to of each piece (i.e. was t? zero, was there s carry out of the high bits of the sum). 

Adding two numbers m 48 bits can be an example of dae flow. In this example two numbers of 48 bits 
{ afO-47) and WCM7) ) are added by performing the following in the hardware: 

a(32-47) + b$2-47)->ens<32-47> step one 

1) save the carry out of high bit of sum 

2) rem ember if partial result was zero or non-zero 

» a<te<)1)+b(ie<31}+savecarnr->anB(1d-31) step two 



1 ) cave the carry out of high bit of sum 

2} remember if partial result was zero or non-zero from this result and from previous step; if both are 
33 zero remember zero; if either is non-zero remember non-zero 

a(r>15) +6(0-15} * saved carry->ens<0-1 5) final step 



<** 1) if this piece is zero and last piece was zero ana is zero 

2) if (his piece is zero and last piece was nan-zero ans ts non-zero 

3) * this piece Is non-zero ans is positive or negative based on sign of sum (assuming no overflow) 

4) if carry into sign of ans os not-equal to carry out of sign of answer, ans has wrong sign and result Is 
an overflow (can not properly represent in the available bits) 

45 The length can be extended by repeating the second step to the middle as many times as required if the 
length were 32 the second stsp wouid not bo performed. the lengih were greater than 48, step two would 
be done multiple times* If the length were just 16 the operation In step one k with conditions 3 and 4 of the 
final step would be done. Extending the length of the operands to multiple lengths of the data Sow is a 
technique having a consequence that the instruction usually takes longer to execute for a narrower data 

so flow. Thai Is. a 32 Mt add on a 32 bll data flow only lakes one pass through me adder logic, while the same 
add on a 1 6 bit data flow takes two passes through tiie adder logic 

What we have done that Is Interesting is that in the current implementation of ihe machine we have 
single Instructions which can perform addsVsuctracts/comp^esAnoves on operands of length 1 to 6 words 
Oengft is defined as part of the fnamjciion). Individual instructions aveiable to the programmer perform the 

63 same kind of operations as shown above (or steps one, two, and final (except to the programmer ihe 
operand length is longer i.e. 19 to 128 bits). A* the bare bones hardware level, we are working on ie bits at 
a time, but ros programmer thinks tfhe's doing 1$ to 1 28 bits at a time. 
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By using combinations of these instructions, operands of any length can be manipulated by ths 
programmer I.e. two Instructions can bo need lo add iwo numbers of up to 263 bits In length. 

PME Processor 

Our PME processor is different from modern mtaoprocessore currently utilized for MPP application*. 
Tha processor portion drffemnces include: 

1. the processor Is e fuly programmable hardwired computer (see the ISA description for en instruction 
set overview) with: 

o it has a complete memory module shown in the upper right comer (see FIGURE Q\ 

o it has hardware registers with contrors required to emulate separate register sets for each interrupt 

level (shown in upper Mi corner), 
o its ALU has She required registers and controls to permit effective muffi-oycle integer and floating 

point arrjhmedc, 

q a has t"0 switching paths needed to support packs* or circuit switch sd data movement between 
PMEs Interconnected by point-to-point links shown In Die lower right comer, 

2. This is ow minimeMst approach to processor design perming eight replications of the PME per chip 
with the CMOS ORAM technology, 

3. This processor portion of the PME provides about the minimum dataflow width required to encode a 
fast instruction Set Architecture {ISA) -see TaWee - which is required to permit effecive ivimd or SIMD 
operation of our MMP, 

PME Resident Software 

The PME is the smallest element of the APAP capable of executing a stored program ft can execute a 
program which is resident in some external control element and fed to it by the broadcast and control 
interface <GCI) in SiMD mode or S can execute a program which is resident in its own main memory <MIWO 
mode), ft can dynamically switch between SI MB mode and MIMD mode representing SIMD/MIMD mode 
dustily functions, and also me system can execute these duaicos at the seme time <SIMWD mode). A 

particular PME can make this dynamic strttch by merely setting or resetting a bit In a control register. Since 
SIMD PME software is actually resident in the external control element, further discussion of this may be 
found in our discussion of the Array Director and in related applications, 

M1MD software is stored into the PME main memory while Ehe PME ia in SIMD mode. This Is Feasible 
since many of tie PMEs wfli contain identical programs because they will be processing similar data tn an 
asynchronous manner. Here we would note that these programs are not fixed, but they can be modified by 
loading the WWD program from an external source during processing of other operations* 

Since the PME Ntsiructfon set architecture represented in the Tables is that of a pnidocomputer, ftsre 
are Cow restrictions with this architecture on Ihe functions which the PME can execute. Essentially each 
PME can function ike a RISC microprocessor. Typical MIMD PME software routines are Sated below: 

1. Basic control programs for dispatching and prioritizing the various resident routines, 

2. Communication software to pass data and control messages among the PMEs. This software would 
determine when a particular PME wouJd go into/out of the "circuit switched" mode, ft performs a ''store 
and forward" function as appropriate. It also Iniiates, sends, receives, and terminates messages between 
its own main memory and that of another PME. 

3. Interrupt handling software completes the context switch* and responds to an event which has caused 
tie interrupt These can include fail-safe routines and rerouting or reassignment of PMEs to en array, 

A. Routines which implement the extended Insertion Set Architecture which we describe below. These 
routings perform macro level instructions such as extended precision fixed point arithmetic, floating point 
artnmeiic. vector arthmetic, and the (ike. This permits not only complex maft to be handled but image 
processing activities for display of image data in multiple dimensions {2d and 3d images) and multimedia 
processes. 

5. Standard mathematical Sbrary functions can be Included. These can preferably Include UNPAK and 
VPSS routines. Since each PME may be operating on a different element of a vector or matrix, the 
various PMEs may all be executing different routines or differing portions of the seme matrix at one time, 
a. Specialized routines for performing ecetlertyither or sorting functions which take advantage of the 
APAP nodal interconnection structure and permit dynamic muw-dimeneionai routing are provided. The 
routines effectively ta*e advantage of some amount of synchronizaJOft provided among the various 
PMEs, while permitting asynchronous operations to continue, Tor sorts, there are sort routines. The 



25 



EP 0 570 729 A2 



APAP is wed suited to a Bateher Sort. Because that sort requires extensive calculafions to determine 
particular element to compare vereus very short comparison cycles. Program synchronization te man- 
aged by fte WO statements. The program allows multiple data etemenfc per PME and very large parallel 
sorts in quite a straight forward manner, 
s While each PME has its ovm resident software, the systems made from these microcomputers can 
execute higher tevef language processes designed for scalar and parallel machines. Thus the system can 
execute application programs which have been written for UNIX machines, or those of other operating 
systems, to high level languages such as Fortran. C« 0+ + . ForfrenD. and so on. 

ft may be an Interesting footnote that our processor concepts use an approach to processor design 
?o which is quite old. Perhaps thirty years of use of a similar ISA design has occurred in lOM's military 
processors- We have been the first to reoogntee that this kind of design can be used *> advantage to 
leapfrog the dead ended current modem microprocessor design when combined with our total PME design 
to move ihe technology Jo a new path tor use in the next century. 

Although the processors design characteristics are quite different from other modern microprocessors, 
is aimttar gate constrained military and aerospace processors have used the design since the •SOs. It provides 
sufficient instructions and registers for straight forward compiler development and both general and signal 
processing applications are effectively running cm this design. Our design has minimal gate requirements, 
and IBM has impiememBd some similar concepts for years when embedded chip designs were needed 
general purpose processing. Cur adoption now of parts of Ihe cider ISA design permits use of many utMUes 
20 and other software vehicles which win enable adoption of our systems at a rapid rate because of the 
existing base and the knowiedgp mar many programmers have of the design concepts. 

PUB VO 

29 The PME will interface to the broadcast and control Interface (BCl) bus by either reading data from the 
bus into fte ALU via the path labeled 6CI in FIGURE 8 or by fetching instructions from the boa directly into 
the decode logic (not shown). The PME powers up in SiMD mode and in that mode reads, decodes and 
executes instructions until encountering a branch. A broadcast command SIMD mode causes the transition 
to MIMD with instructions fetched toceUy. A broadcast PME instruction 'INTERNAL now reverts the state. 

30 PME I/O can be sending data, passing data or receiving data. When sending data, the PME sets the 
CTL register to connect XMIT to either L v v, or X, H/W services fieri pass a block of data from memory 
to the forget via the ALU multiplexer and the XMIT register. This processing interleaves with normal 
instruction operation. Depending upon application requirements, the block or data transmitted can contain 
■aw data for a predefined PME and/or commands to establish paths. A PME that receives data will store 

ot input to memory and interrupt the active lower level processing. The interpretation task at the interrupt level 
can use the interrupt event to do task sywhronization or Initiate a transparent I/O operation (when data is 
addressed elsewhere,* During the transparent I/O operation, me PME is tree to continue execution. Its CTL 
register makes it a bridge. Data will pass through ii without gating, and it wHI remain in thai mode until an 
instruction or the data stream resets CTL WhHe a PME Is passing data it cannot be a data source, but tt 

# can be a data sink for another message, 

PME Broadcast Section 

This is a chip-tc-common control device interface. It can be used by the device that serves aa a 
45 controller to oommand NO, or rest and diagnose the complete chio. 

Input is word sequences (either instruction or data) that are available to subsets of PMEs- Associated 
wtlh each word la a code IncfleaUng which PMEa w-BJ use the word. The ihe BCl will use the word both to 
limit a rue's access to the interface and to assure that all required PMEs receive data- This serves to 
adjust the BCl to the asynchronous PME operations. (Even when in SIMD mode PMEs are asynchronous 
so due to I/O and Interrupt processing.) The mechanism permits PMEs to be Formed into groups which are 
controlled by interleaved sets of commandaJata words received over the BCL 

Besides delivering data to the PMEs, the BCl accepts request codes from the PME combines them, 
and transmits the Integrated request This mechanism can be used in several ways. MIMD processes can 
be inflated m a group of processors tnat all end wrth an output tig**. The AW of *igr>a!a triggers the 
era controller to initiate a new process. Applications, in many cases, will not be able to load all required 8<W in 
PME memory. Encoded request to the controller wil be used to acquire a S/w overlay from perhaps the 
host's storage system. 
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The controller usee a serial scan loop through many chips to acquire information on individual chips or 
PMEb. These loops Initially interface id the BCl but can in (he B& be bridged to Individual PMEs. 

Broadcast and Control Interfere 

The BCl broadcast and control interlace provided on each chip provides a parallel input Interface such 
that data or instructions can be sent to the node. Incoming data is tagged with a subset identifier and the 
BCl includes the functions required to assure that all PME$ within the node* operating within the subset, are 
provided the data or Instructions, The paranel Interface of the 601 serves both as a port to permit data to be 
w broadcast Id all PMEs and as the instruction interface during SIMD operations- Satisfying both requirements 
plus extending those requirements so supporting subset operations is completely unique to this design 
approach. 

Our BCl parallel input interface permits data or instructions to be sent from a control element that » 
external to the node. The BCl contains "group assignment 1 * registers (see the grouping concepts in our 

;s above application entitled GROUPING OF SIMD PICKETS) which are associated with each of the PMEs. 
Incoming data words are tagged with a group identifier and the BCl Includes the functions required to 
assure that all PMEs wfliin the node which are assigned to the dedicated group are provided the data or 
instructions. The parallel interface of the BCl serves as both a port to permit data to be broadcast to the 
PMEs during MJMD operations* and as ihe PME InstacUonAmmedtate operand Interface during SIMD 

no operations. 

The BCl also provides two serial interfaces. The high speed serial port will provide each PME with the 
capability to output a limited amount of status ^formation, That data is intended to: 

1. signal our Array Director 610 when the PME, *,g. 500, has data mat needs to be read or that the PME 
has competed some operation, ft passes a message to the external control element represented by the 

20 Array Director. 

2. provide actrvfty status such that external test and monitor element* can illustrate the «ratu* of me 
entire system. 

The standard serial port permits trie external control element means tor selectively accessing a specie 
PME for mentor and control purposes. Date passed over this interface can direct data from the BCl parallel 

so interface to a particular PME register or can select data from a particular PME register and route it to the 
high speed seriaf port. These control points allow the external control element to monitor and control 
individual PMEs during initial power up and diagnostic phases. It permits Array Director to input control data 
so as to direct Bs© port to particular PME and node Internal registers and access points. These registers 
provide paths such that PME of a node can output data to the Array Director, and these registers permit the 

38 Array Director to input date to the units during initial power up and diagnostic phases. Data Input to access 
point can be used to control lest and diagnostic operations, le- perform single instruction step* slop on 
compare, break points, etc. 

Node Topology 

Our modified hypercube topology connection is most useful tor massively parallel systems, while other 
less powerful connections can be used with our basic PMEs. Within out Initial embodiment off the VLSI chip 
are sight PMEs with tuay distributed PME Internal hardware connections. The Internal PME to PME chip 
configuration is a two rings of four PMEs, with each PME also having one connection to a PME in the other 

4 ring. For the case of eight pmes in a vlsi chip this is a three dimensional binary hypercube* however our 
approach in genera) does not use hypercube organizations within the chip. Each PME also provides for the 
escape of one bus. In the initial embodiment the escaped buses form one ring are caBed +X +Y, + W and 
+Z> while those from fte other ring ere labeled similarly except - (minus). 

The specific chip organization is referred to as the node erf the array, and a node can be m a cluster of 

3fl the array. Trie nodes are connected using +-X and +-Y Into an array > to create a duster. The 
dimensionality of ma array is arbitrary, and In general greater than two which is the condition required for 
developing a binary hypercube. The dusters are then connected using *-W, +-Z Into a array of clusters. 
Again, the dimensionality of the array Is arbitrary. The result Is tie 4-cflmensionaJ hypercube of nodes. The 
extension to a 5-olmensionai hypercube requires the usage erf a i0 PME node and uses the additional two 

ss buses, say +-EI to connect the 4-dtmensional hypercube into a vector of hypercubes. We have men shown 
the pattern of extension to either odd or even radix hypercubes. This modified topology limits the cluster to 
cluster wiring while supporting the advantages of the hypercube connection. 
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Our wireabrtiiy and topology configuration for massively parallel machines has advantages m keeping 
the X and Y dimensions wHhin our duster level of packaging, and in distributing the W and Z bus 
connections to a£ the neighboring dusters* After impiemsrtiing f»e techniques described* the product will be 
wiroabte. and manufecturabls white masTtaining the inherent characteristics of the topology defined, 

s The node consists of k ' n PMEs plus the Broadcast and Control biterface <BCI) section. Here •n" 
represerrfc the number of dimensions or rings, which characterize the modified hypercube. while V 
represents ihe number of rings that characterize the node. Although a node can contain k rings k Is a 
characteristic of the concept that only two of tfiose rings may provide escape buses. w n" and 'k ,f Is limited 
in our preferred embodiment, because of the physics* chip package to N-4 end k*2. Thte iimtetion te a 

20 physieai one, and different chips sets Hill allow other and increased dimensionality in the array. In addition 
to being a part of the physical chip package, tt is our preferred embodiment to provide a grouping of PMEs 
that interconnect a set of rings in a modified hypercube. Each node will have 8 PMEs with their PME 
architecture and aba&y Jo perform processing and data router functions. As such, rt is the tfmensionality of 
the rnodHied hypercube (see following section), i.e* a 4d modified hypercube's node element would be 8 

*5 PMEs while a 5d modified hypercube's node would be 10 PMEs. For visueBzaoon of nodes which we can 
employ, refer to FIGURE 6, as well as FIGURES 9 and 10 for visualization of InterconnecHona and see 
FIGURE 11 for a block diagram of each node. FIGURES 16 and 10 elaborate on possible interconnections 
tor an APAP. 

If will be noted thai the application entitled -METHOD FOR INTERCONNECTING AND SYSTEM OF 
20 INTERCONNECTED PROCESSING ELEMENTS 1 ' Of CO-inventQt David B. RoHe. filed in the United States 
Parent and Trademark office on May 13, 1991, ureter USSN 0*696.868, described the modified hypercube 
criteria which can preferably be used in connection with our APAP MMP, 

That application is incorporated by reference and describee a method of interconnecting processing 
elements in such a way that the number of connections per element can be balanced against the network 
25 diameter (worst case path length}. This Is done by creating a topology that maintains many of the well 
known and desirable topological properaes of hypercubes white improving its flexibi&ty by enumerating the 
nodes ol the network in number systems whose base can be varied, When using a base 2 number system 
this method creates the hypercube topology. The invention has fewer interconnections than a hypercube. 
uniform connections and preserves the properties of a hypercube. These properties include: i) large 
sa number of alternate paths, 2) wery high aggregate bandwidth, and 3) well understood and existing memods 
that can be used to map other common problem topologies with the topology of the network. The result a 
generalized non-binary hypercube with less density. It will be understood that with the preference we have 
given to the modlfcad hypercube approach, in some applications a conventional hypercube can be utilized. 
In connecting nodes, other approaches to a topology could be used; however, the ones we describe herein 
33 are believed to be new and an advance, and we prefer the ones we describe. 

The interconnection methods for the modified hypercube topology for Interconnecting a plurality of 
nodes in a network of PMEs: 

1. defines a sets of integers ei, ©2, e3. „, such the product of all elements equals (he number of PMEs 
tn the network called ML while the product of aP elements In the set excepting el and e2 Is the number 
40 ot nodes caned N, and the number of elements In the set called m defines the dimensionality of the 

network n by the relationship n = nv2, 

Z addresses a PME located by a eel ot indexes ai, a2> ... am where each Index la the PMEs position in 
the equivalent revel of expansion such that the Index ai la In the range of zero to ei-1 for > equal to 1, 2, 
to m.. by the formula 

(...(a<m) * e(m-1) ♦ a {m-2»e{rn-1) a<2ye(1)>*a(1) 

wt*re the notation a® has me normal meaning of the me ith element in the fist of elements called a, or 
equivaJently for e- 

so 3. connects two PMEs (with addresses f and g) If and onfy if either of the following two conditions hold: 

a. the mteger part of (ei r e2j equals the integer part ot s/(ei 1 e2) and: 

1 , the remainder part of rfel dntere from the remainder pan of s/ei by 1 or, 

2. the remainder part of rfe£ differs from the remainder part of s/e2 by 1 or e2-i. 

b, the remainder parrot r*ei defers from the remainder part of s/eJ for i in me range 3. 4, m and the 
as remainder part of tfel equals the remainder part of sfeS which equals i minus three, and the 

remainder pan of r/e2 differs from the remainder pert of $/e2 by e2 minus one. 
A3 a result the computing system nodes will form a non-binary hypercube, wiih the rxrterraal lor being 
different radix in each dimension. The node is defined as an array of PME a which supports 2*n ports such 
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that the ports provided by nodes match the dijnensbnetty requirsments of the modified hypercube, the 
set of integers 03. e4. em, which deDne the spaefflc extent of each o5me*iBk>n of a particular mocHfted 
hypercvbe are aP taken as equal, say b t and if ei and 92 are taken a 1, then th© previous formulas for 
addressability and connections reduce la 

2. addressing a PME as numbers representing the base b numbering system 

3. connecting two computing elements (? and g) if and only if the address of f differs from the address of 
a. in exactly ono base b digit using the rule tet 0 and 0-1 are separated by 1. 

4. *e number of connections supported by each PME is 2m 

10 Which is exactly as described in the base application, with (ha number of communication buses spanning 
non-eolacem pwes chosen as zero, 

Infra-Node PME BUsroonnections: 

js PMEs are configured wfthin the node as a 2 by n array. Each PME is interconnected with its three 
neighbors (edges wrap only in the second dimension) using a set of rnputfoutput ports, thus* providing fuD 
duplex communication capability between PMEs- Each PMEs external input and output port is connected to 
node I/O pins. Input and output ports may be connected to share pins for hatf-duptex communication or to 
separate pins tor full-duplex capability. The InbsrconnecrJons tor a 4d modified hypercube node are shown 

20 in figures 9 and 1 0- {Note that where n even fr>e node can be considered to be a 2 by 2 by n/2 array.) 
FIGURE 9 illustrates the the eight processing dements 500. 310. 520, 530, 540, 550, 580, 570 within 
the node. The PMEs are connected in a binary hypercube communication network. This binary hypercube 
displays three intra connections between PMEs (501, 51 1 521, 531, 54i, 551, 561, 571, 590. 591, 592, 593), 
Communication between &e PME is controlled by in and out registers under control of the processing 
element This diagram shows the various paths that can be taken to escape I/O out any of the eight 
directions, +-w 525, 565. +-x 515. 555, +-y 505. 545, +-z 535, 575, The communication can be 
accomplished without storing the data Into memory if desired. 

It may be noted that whEe a network switch chip could be employed to connect various cards each 
having our chip with other chips of me system, h can and should desirably be eliminated. Our imer PME 

3D network thai we describe as the "4d torus" is the mechanism used for Inter PM ^communication. A PME 
can reach any other PME in fte array on this interface. (PMEs in between may be Stor&Forward or Circuit 
Switched) 

Chip Rejatjonshjpg for interoonnections 

We have discussed (he chip, and figure 11 shows a block diagram of the PME ProtssaotfMemory 
chip. The chip consists of me following elements each of which vvifl be described in later paragraphs: 

1. d PMEs each consisting of a 16 bit programmable processor and 32K words of memory (64K bytes), 

2. Broadcast Interface (BO) to permit a controller to operate ail or subsets of the PMEs and to 
40 accumulate PME requests. 

3. interconnection Levels 

a. Each PME supports four 6 btt wide tnter-PME communication paths. These connect to 3 
neighboring PMEs on the chip and 1 oft chip PME. 

b. Braadcast-toPME busing » which makes data or instructions available. 

<s c. Service Request lines that permit any pme to send a code to the oontroner. The bci combines the 
requests and forwards a summary, 

d* Serial Service loops permit (he controller to read all detail about tfie functional blocks. This level of 
interconnection extends from the BOI to all PMEs (FIGURE 1 1 lor ease of presentation omits this 
detail-) 

90 

interconnection and Routmg 

The MPP w3i be implemented by repftcalcn of the pme. The degree of replication does not affect the 
interconnection and rouiing schemes used, FIGURE 6 provides an overview of the nework interconnection 
5& scheme. The chip contains Q PMEs with interconnections to their immediate neighbors. This interconnection 
pattern results in $e three dimensional cube structure shown in FIGURE 10. Each of the processors within 
me cube has a dedicated Wdireceonal byte port to the chip's pins; *e refer to the set of 8 PMEs as a node. 
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An n by n array of nodes is a duster. Simpte bridging between me * and - x ports and the + end - y 
porta provide me duster node Interconnections, Hero me our preferred chip or nod© has 8 PMEs. each of 
which manage; a single external port This distributes the osiwork control function and eliminates a possible 
botrJenock for ports. Bridging the outer edges mates the cluster into a logical torus. We have considered 
s dusters with n=4 and n = 8 and befieve that n =8 Is the better solution for commercial applications whHe 
n=4 te better for military conduction cooled applications. Our ooncept does not impose an unchangeable 
cluster size. On the comrary, we anticipate some applications using variations. 

An array of dusters results in the 4 dimensional torus or hypercube structure illustrated In F10URE 10. 
Bridging between the + and - w ports and + and - z ports provides the 4d torus interconnections- This 
70 results in each node within a cluster connected to an equivalent node in el adjacent clusters. (This provides 
04 porta between two adjacent dusters rather than the 0 ports that would result from larger casters.* As 
with the duster size, the scheme does not imply a particular size array. We have considered 2x1 arrays 
desirable lor workstations and MIL applications and 4*1. 4x8 and 0x8 arrays lor mainframe applications. 

Developing an array of 4d toruses is beyond the gate, pin, and connector limitations of our current 
is preferred crop. However, thai limitation disappears with our alternative on-chip optical driver/rocaiver is 
employed. In this embodiment our network could use an optical path per PME; with 12 rather than 8 PMEs 
per chip the array of 4d toruses with muWTflop (Teraftap) performance end it seems to be economically 
feestote to make such machines available lor the workstation environment Remember that such atoma&ve 
machines will use the application programs developed for our current preferred embodiment 

4d duster Organization 

For constructing a 4d modified hypercube 360, as ahistreted in FIGURES 8 end 10, nodes supporting 8 
external ports 315 are required. Consider the external ports to be labeled as +X, +Y. + 2, +V¥, -X, -Y, -2. 
25 -W. Then using m 1 nodes, a ring can be constructed by connecting the +X to -X ports. Again m 2 such 
rings can be rntemorwected into a ring of rings by interconnecting ifte matching *Y to -Y ports. This level 
of structure will be called a cluster 320, With rrn =m 2 = S it provides tor 51 2 PMEs and such a cluster will 
be a building block for several stee systems <330, 340. 3501. as Illustrated with m=8 in FIGURE $. 

so 4d Array Organization 

For building large fine-grained systems, sets of m 3 clusters are interconnected in rows using the +Z to 
-Z porta. The rm rowa are then interconnected using the 4 W to -W ports. For mi ~.,m>-8 thte results In 
system with 32768 or PMEs. The organization does not require that every dimension be equal v 

33 populated as shown in FIGURE 8 (large fine-grained parallel processor 370). In the case of the fine-grained 
smaa processor, only a cluster might be used wtth the unused Z and w ports being interconnected on the 
card. This technique saves card connector pins and makes possible the application of this scalable 
processor to worts stations 340, 350 and avionics applications 330, both of which are connector pin limited. 
Connecting *F- ports together In the Z and W pairs leads to a workstation organization that permits debug, 

40 test and large machine software development 

Again, much smaller scale versions of ihe structure can be developed by generating the structure with a 
value smeller than ma 8. This win permit construction of single card processors compatible with the 
requirements tor accelerators in the PS/2 or RISC System 6000 workstation 350, 

45 i/Q Performance 

I/O performance Includes overhead to setup transfers and actual burst rate data movement Setup 
overhead depends upon application function ¥0 complexity and network contention. For example, an 
application can program circuit switched traffic with buttering to reserve conflicts or it can have all PMEs 
so transmit left and synchronize. In lhe first case, I/O is a major task and detailed analysis would be used to 
size it We estimate that simple case setup overhead is 20 to 30 clock cycles or £ to 1 2 u-sec. 

Burst rate VO Is the maximum rate a PttfE can transfer data (with either an on or off chip neighbor.} 
Memory access limits set the data rate at 140 nsec per byte, corresponding to 7.14 Mbyte/s. This 
performance includes buffer address and count processing pfas data maovwrito- It uses seven 40ns cycles 
as per 18 bit word transferred. 

This burst rate performance corresponds to a duster having a maximum potential transfer rate of 3,95 
Qbytes/». If also means that a set ol eight nodes along a row or column of the cluster will achieve 67 
Mbyte's burst data rate using one set of their 8 available ports. This number is significant because I/O with 
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the external world wW be done by logically 'unripping' an edge of the wrapped cluster and attaching it to 
the externa! system bus* 

tnter-Pftffls Routing Protocol 

3 

The SIMD/MIMD PME comprises in^ipfocsssor communication to the externa) I/O facilraes. broadcast 
corrarol imerfaces, and switching features which afcow both SlMD,fr1IMD operaflon whhln the same PME. 
Embedded In the PME la the fuBy distributed progiwmeble I/O router for processor communication and 
data transfers between PMEs, 
70 The PMEs have fully distributed (nterprocessor communication hardware to on-chip PMEs as well as to 
the external I/O facilities which connect to the Interconnected PME* m the modified hypercube configura- 
tion. This hardware is complemented wan the flexible progmrrim^bfJfty of the PME to control the I/O activity 
via software. The programmable I/O router functions provide for generating data packets and packet 
addressee. With this information the PME can send the information thru the network of PMEs in a directed 
/5 method or out multiple perns determined by any fault tolerance requirements. 

Distributed taufc tolerance algorithms or program algorithms can lake advantage of the pragrarnmabRity 
along wHh the supported circuit switched modes of the PME. This performance combinational mode 
enables everything from offline PMEs or optimal path data structures to be accomplished via the 
programmable I/O router. 

20 Our study of applications reveal* that it is eornetimee meet efficient to send bare date between pmes. 
At other times applications require data and routing information- Further, it is sometimes possible to p4an 
communications so that network conflicts cannot occur: other applications offer the potential for deadlock, 
unless mechanisms for buffering messages at Intermediate nodes are provided. Two examples ttustraie me 
extremes. In the relaxation phase of a PDE solution, each grid point can be allocated to a node. The inner 

£0 loop process of acquiring data from a neighbor can easily be synchronized over all nodes. Alternatively, 
image fterafor matrons use local date parameters to determine communication target or source identifier*. 
This results in data moves through mutepfe PMEs, and each PME becomes involved In (he muting task for 
each packet. Preplanning such traffic is generally not possible* 

To enable the network to ce efficient for all types of transfer retirements, we pertHfon, between the 

ro H/W and 8/W« the responsibility tor data routing between PMEs, $m does most of the task sequencing 
function. We added speciel features to the hardware (H/W) to do the inner loop transfers and minimize 
software overhead en the outer loops. 

I/O programs at dedicated interrupt levels manage the network. For most applications, a PME dedicates 
four frrterrupt levels to receiving date from the four neighbors. We open a buffer at each level by loading 

35 registers at the level, and executing the IN (a usee buffer address and transfer count but does not await 
date receipt) and RETURN instruction pair. Hardware then accepts words from the particular input bus end 
stores them to the buffer. The butsei fun condition will men generate me interrupt and restore me program 
counter to (ho instruction after the RETURN, This approach to frtterrupt levels permits Lt) programs to be 
written that do not need to test whet caused me interrupt Programs reed data, return, end then continue 

40 directly into processing the date they read. Transfer overhead la minimised as meet situations require iritis 
or no register saving, Where an application usee synchronization on I/O, as in (he POE example, then 
programs can be used to provide that capability. 

Write opercftons can be started in several ways. For the PDE example, at the point where a result is to 
be sent to a neighbor, the application level program executes a write calk The call provides buffer location, 

46 word count and target address. The write subroutine includes the register loads and OUT rostructtons 
needed to Initiate the riW and return to the sppOcetkm. H W does the actual byte by byte data transfer. 
More complicated output requiremerris will use an output service function al the highest interrupt level. Bom 
application and interrupt level tasks access met service via a soft interrupt. 

Setting up circuit switched paths builds on these simple read and write operations. We start wen all 

so PMEs having open buffers sized to accept packet headers but not the data. A PME needing to send daia 
initiates the transfer by sending an address/data block to a neighboring PME whose address baiter matches 
the target In the neighboring PME address information will be stored; due to buffer full an interrupt will 
occur. The interrupt S/w tests (he target address and will either extend (he buffer to accept the data or write 
the target address to an output port and set trie QTL register for transparent data movement. (This allows 

ss the PME to overlap its application executions with the circuit switched bridging operation.) The CTL register 
goes to busy state and remains transparent until reset by the presence of a signal at end of data stream or 
abnormally by PME programming. Any number of variations on these themes can be implemented. 
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System I/O and Array Director 

FIGURE 12 shows an Array Director tn the preferred embodiment which may perform the ^motions of 
the controller of FIGURE 13 which describes the system bus to array connections, FIGURE 13 b composed 

5 of two parts, (a) (he bus to/from a cluster, and part (b) die communication of Information on the bus to/from 
a PME. Loading or unloading me array id done by connecting *e edges of dusters a> the system bus. 
Multiple system buses can be supported with muMpIs clusters. Each duster supports 50 to 57 Mbyte's 
bandwidth. Loading or unioacfing the parallel array requires moving data between an ot a subset of the 
PMEe and standard buses <ie MicroChannel, VME-bus, FutureBus, etc). Those buses, part of the host 

jo processor or array controller, are assumed io be rigidly specified. The PME Array therefore must be 
adapted » the buses. The pme Array can be matched to the bandwidth of any bus by interleaving bus data 
onto n PMEs, with n picked to permit PMEs both I/O and processing time. FIGURE 13 shows how we might 
connect the system buses to the PMEs ai two edges of a cluster. Such an approach would permit 114 
Mbyte/3 to be supported, it also permits data to be loaded at half the peak rate to two edges simuHa- 

>5 rteousty. Although this reduces ths bandwidm to 57 Mbyie/s/clusiBr, ri has the advantage of providing 
orthogonal data movement within the array and abBity to pass daia between two buses. (We use those 
advantages to provide fast transpose and matrix mutiiply operation.) 

Ae shown in part (a) of FIGURE 13, the bus "dots to all paths on the edges of the cluster; and. the 
controller generates a gate signal to each path in the required Interleave timing. If required to connect to a 

a> system bus wqh greats* than 57 Mbyte.'*, the data will be interleaved over multiple dusters. For example, in 
a system requiring 200 Mbyte/8 system buses, groups of 2 or 4 dusters will be used. A large MPP has the 
capacity to attach ie or 94 such buses to its xy network paths. By using the w end z paths in adolUon to the 
x and y paths, that number could be doubled. 

FIGURE 13 part (b) shows how the data routes to individual PMEs. The FIGURE shows one particular 

29 w.x.y or z path that can be operated at 7 13 Mbyte's in burst mode- if the data on the system bus occurred 
to bursts, and if the PME memory could contain the complete burst, then only one PME would be required. 
We designed the PME I/O structure to require neither of these conditions. Data can be gated into the 
PMExfj at the full rate until buffer full occurs. At that Instant PME*0 will change to transparent and PUExi 
will begin accepting the daft. Within PMExO processing of the input aata buffer can begin. PMEs that have 

30 taken data and processed It are limited because they cannot transmit the results white In die transparent 
mode. The design resolves ffws by switching the data stream to the oppoarie end of the path at intervals. 
FIGURE 13(b) shows that under S W control one might dedicate PMExO through PMExS to accepting data 
while PMEx12 through PMEx15 unload results and visa-versa. The controller counts words and adds end of 
block signals to the data stream, causing the switch in direction. One count applies to ail paths supported 

as by the controller so controller workload is reasonable. 

SYSTEMS FOR ALTERNATIVE COMPUTERS 

FIGURE IS illustrates a system block dagram tor a host attached large system with a single epp&catlon 

6 processor interface CAP). This IttostraBon may also be viewed wtfh the understanding that our invention may 
be employed in stand alone system which use multiple application processor interfaces (not shown J This 
configuration will support DASD/GrahcJcs on all or many clusters. Workstation accelerators can eliminate the 
host, application processor interface (API> and cluster synchronizer (OS) illustrated by emulation. The CS 
not always required. It will depend on (he type of processing that is being performed, as wel as the 

<8 physicsi calve or power provided tor a particular application which uses our invention. An eposcaflon this is 
doing mostly MIMD processing will not place as high a workload demand on the corrtotter, so here the 
control bus can see very alow pulse rise times. Conversely, system doing mosSy asynchronous A-SlfelD 
operator with many independent groupings may require faster control busing. In this case, a cluster 
synchronizer will be desirable. 

so The system blocfc diagram of FIGURE 18 illustrates that a system might consist of host, array controller 
and PME array. Ths PME array is a est of dusters supported by a set of cluster controllers (CC). Although 
a CC is shown for each duster that relationship Is not soictfy required. The actual ratio of clusters to CCb Is 
flexibly The CC provides redrfoe to> and accumulation from the 64 BCIs/cluslers. Therefore, physical 
parameters can be used establish the maximum ratio. Additionally, the CC will provide tor controlling 

63 multiple independent subsets of the PME array; that service might also become a gating requirement A 
study con be made to determine these requirements tor any particular application of our invention. Two 
versions of the CC will be used. A cluster thai Is to be connected to a system bus requires the CC providing 
interleave controls (see System I/O and FIGURE 16} and (ri-state drivers. A more simple version that omits 
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the tri-state busing features can also bo employed. In the case or large systems, a second stage of red rive 
and accumulation to added. This level is the duster synchroniser (C&). The set of CCb plus OS and the 
Application Processor Interfere (API) make up the Array Controller. Only the API is a programmable unit. 

Several variations of this system synthesis scheme wfll be used. These result in different hardware 
5 ccrtflguratlcm for various applications, but Oiey do not have a major Impact on the supporting software. 

For a workstation accelerator, the cluster controHars will be attached directly to the workstation system 
bus; the function of the API will be performed by the workstation. In the case of a RISC/8000, The system 
bus Is a Micro Channel and the CC units can plug cHrecHy Into the slots within fro workstation. This 
configuration places the I'O devices (QASO, SCSI and display interfaces) on the same bus that 
w ioads/untaads the array. As such the parallel array can be used Cor I/O intensive taska like real time image 
generation or processing. For workstations using other bus systems {VME-bus, FutureBus. e gateway 
interface mil be used. Such modules are readily available in the commercial marketplace, Noie that in these 
minimal scale systems a angle CC can be shared between a determined number or clusters, and neither a 
C3 nor an API Is needed. 

/s A MIL avionics application might be similar in size to a workstation, but It needs difrerent interfacing. 
Consider what may become ft© normal military situation. An existing platform must be enhanced with 
additional processing capability, but funding prohibits a complete processing system redesign. For this we 
would attach to the APAP array a smart memory coprocessor. In this case, a special application program 
interface API that appears to the host as memory will be provided. Data addressed to the host's memory 

20 wll then be moved to the array via CC($). Subsequent writes to rnamory can be detected end interpieted as 
commands by the API so that the accelerator appears to be a memory mapped coprocessor. 

Large systems can be developed as either host attached or as stand alone configurations. For a host 
attached system, the configuration shown in FIGURE 18 is useful. The host will be responsible for I/O, and 
the API would serve as a dispatched task manager. However, a large stand alone system is also possible in 

£S special situations* For example, a database search system might eliminate the host, attach DA$D to the 
MicroChannel* of every cluster and use multiple APIs as bus masters slaved to the PMEs. 

Zipper Airay Interface wtfr External I/O 

so Our zipper provides a fast I/O connection scheme and is accomplished by placing a switch between two 
nodes of the array. This switch will allow for me parallel communication into and out of the array. The fast 
i'O will be implemented along one edge of the array rings and acts like a large zipper into the X. Y, W, Z 
rings. The name "zipper connection* la given to the fast I/O. Allowing data to be transferred Into and out of 
the network while only eddtng switch delays to transfer the data between processors is a unique loading 

38 technique. The switching scheme does not disrupt the ring topology created by the X, Y, W, Z buses and 
special support hardware allows the ripper operation to occur while the PE Is processing or routing data 

The ability to bring data into and out of a massively p&allel system rapidly is en important 
enhancement to the performance of the overall syslemWe believe thai the way we trnplement our fast l>0 
without reducing the number of processors or dimension of the array network is unique in the field of 

40 massively parallel environments. 

The modified hypsrcobe anrangement can be extended to permit a topology which comprises rings 
within rings. To support me interface to the external I/O any or all of the rings can be logically broken. The 
two ends of the broken ring can then be connected to external I/O buses, Breaking the rings is a logical 
operation so as to permit regular inter-PME communication at certain time interval while permitting I/O at 
other time intervals. This process of breaking a level of rings wlthtn the modified hypercube effectively 
'unzips 1 rings for UO purposes. The fast I/O "zipper" provides a separate interface into the array. This zipper 
may exist on 1 to n edges of the modified hypercube and could support either parallel Input into multiple 
dimensions of the array or broadcast to multiple dimensions of tlie airay. Further data transfers into or out 
of the array could alternate between the two nodes dlrectfy attached to the zipper. This UO approach is 

so unique and it permits developing different zipper sizes to satisfy particular application requirements. For 
example, in the particular configuration shown m FIGURE 6, called the large fine-grained processor 369, the 
zipper for the Z and W buses will be dotted onto ihe MCA bus. This approach optimizes the meat* 
transposition time, satisfying a particular application requirement for the processor. For a more detailed 
explanation of ihe "zipper" structure, reference may be had to the APAP irO ZIPPER application filed Hied 

55 concurrently herewith. The zipper is here illustrated by Figure 14. 

Depending on the configuration and the need of the program to roll data end program into and out of 
#te individual processing elements, the size oi the zipper can be vaned. The actual speed of the K> zipper 
is approximately the number of rings attached times ihe PME bus width, times the PME dock rate ail 
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cfivided by 2. (The dwision pcrmfts the receiving PME time to mow data onward. Sine* it can send it to any 
of n places I/O contention to completely absorbed over the Array.) With existing technology, te_ 5 MB/sec 
PME transfer rate, 64 ring* on the zipper, and interleaved to two nodes transfers, 32QMB/sec Away transfer 
raxes are possible. (Sea the typical zipper oormguralion in FIGURE 14). FIGURE 1 £ illustrates the test VO or 
s the so-called 'zipper connection' 700, 710 which extst9 as a separate interface Into trie array. This zipper 
may exist on one 700 or two edges 700. 710 erf the hypercube network by dotting onto the broadcast boa 
720, 730, 740, 750, at multiple nodes In the array 751, 752, 753, 754 and in muSJpte directions 770, 7fi0, 
790. 7$1. 756, 757. 

Today's MCA bus supports GO to 160 MB per second buret transfer rate and therefore is a good match 
?o for a siogte zipper in simple or non-interleaved mode. The actual transfer rate given channel overhead and 
efficiency is something less than that For systems that hove even more demanding VO requirements, 
multiple zippers and MCA. buses can be utilized These techniques are seen to be important to processors 
that would support a large externa! storage associated wilh nodes or clusters, as might be characteristic ol 
database machines Such I/O grow* capability is completely unique to this machine and has not previously 
^5 been incorporated in either massively parallel, conventional single processor, or coarse-grained parallel 
machines. 

Array Pirector Architecture 

& Our massively parallel system is made up of nodal building blocks of multiprocessor nodes, dusters of 
nodes, and arrays of PMEs already packaged in clusters. For control of these packaged systems we 
provlcfe a system array director which with the hardware controllers performs the overall Processing 
Memory Element (PME) Array Controller unctions in the massively parallel processing errvirctf inert- The 
Director comprises of three functional areas, the Application Interface, the Cluster Synchronizer, and 

29 normally a Cluster Controller. The Array Director will have the overall control ol the PME array, using the 
broadcast bus and our zipper connection to steer data and commands to all or the PMEs. The Array 
Director functions as a software system interacting with the hardware to perform the rote as the ehdl of the 
operating system. The Array Director In performing this role receives commands from the application 
interface and issuing the appropriate array instructions and hardware sequenced to accomplish the 

ao designated task. Hie Array Director's main function Is to continuously reed the Instructions to Hie PMEs and 
route data in optimal sequences to keep the traffic at a maximum and collisions to a minimum. 

The APAP computer system shown in FIGURE 6 is illustrated in more detail in the diagram of FIGURE 
12 which Illustrates the Array Director which can function as a controller, or array controller, m illustrated In 
FIGURE 13 and FIGURES 18 and 19. This Anay Director 610 illustrated in FIGURE 12 fe shown in the 

as preferred embodiment of an APAP in a typical configuration of n identical array clusters 685, 670, 680, 690, 
wmi en array director 610 tor the clusters of 5i2 PMEs, and an application processor interface 030 for the 
application processor or processors 600. The synchronizer 650 provides tlte needed sequences to the anay 
or cluster controller 640 and together they make up the "Array Director" 6 10. The application processor 
interface 630 wUl provide the support tor the host processor 600 or processors and test/debug workstations. 

49 For APAP unto attached to one or more hosts, ihe Array Director serves as the interface between the user 
and the array of PMEs. For APAPe functioning as stand alone parallel processing machines, toe Array 
Director becomes the host unit and accordingly becomes involved in unit 10 activities. 

The Array Director wilt consist of the following four functional areas: (see the functional block diagram in 
FIGURE 12) 

*t 1 « Application Processor interface (API) 600, 

2. Ouster Synchronizer (CS) 650 (Q x 8 array of cJustsrsV. 

3. Ouster Controller (CC) 640 x 1 array of nodes). 

4. Fast ir'O {apper Connection) 620. 

so The Application Processor Interface (API) 630: 

When operating in attached modes, one API will be used for each host That API will monitor the 
incoming date stream to determine what are Instructions to the Array clusters 665. 670, 600, 690 end whet 
are data for the Fast I'O (zipper) 620. When in standalone mode, ma API serves as the primary user 
69 program host. 

To support these various requirements, the APIs contain the only processors within aie Afray Director, 
plus ihe dedicated storage for the API program and commands. Instructions received from me host can call 
for execution of API subroutines, loading of API memory with additional functions, or for loading of CC and 
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PME memory with naw SW- As described in ihe 3n?V overview section, these venous type requests can be 
restricted to subset of users via the Initial programs loaded Into the API. Thus* the operating program 
loaded mil determine the type of support provided whtch can be isiiored to match the performance 
capability of the API This farther permits the APAP to be adjusted to die needs of multiple users requiring 
5 managed and wen tested services, or to the Individual user wishing to obtain peak performance on a 
particular application. 

The API also provides for managing the path to and from the I/O zipper. Data received from the host 
system In attached modes, or from devices In standalone modes is forwarded to the Array. Prior to initiating 
this type of operation the PMEs within the Array which wffl be managing the fr'O are Initiated, PMEs 

w operating in MIMD mode can utilize the last interrupt capability and either standard SAV or special functions 
for this transfer whse those operating in SIMD modes would have to be provided detailed control 
instructions. Data being sent from the VO zipper requires somewhat She opposite conditioning. PMEs 
operatmg in MIMD modes must signal (he API via the high speed serial interface and await a response from 
the API, whfle PMEs In SIMD modes ere already In synchronization with jhe API and can therefore 

is immsct atsry output data. The ability to system switch between modes provides a unique ability to adjust the 
program lo the application. 

Cluster Synchronizer (C3) 650 

so The 03 650 provides the bridge between the ar 630 and CC 640. ft stores API 630 output In FIFO 
stacks and monitors the status being renamed from we CC 650 (both parallel input acknowledges and high 
speed serial bus data) lo provide hie CC, In timely fashion, with the desired routines or operations that need 
to be started. The CS provides the capabifty to support different CCs and different PMEs within clusters so 
as lo permit dividing the array into subsets. This is done by partitioning the array and then commanding the 

£s involved cluster controllers to selectively forward the desired operation. The primary function of the 
synchronizer is to keep all clusters operating and organized such that overhead time is minimized or buried 
under the covers of PME execution time We have described how ihe use of the duster synchronizer in A- 
$imd conflguratjons is especially desirable- 

m Cluster Comroaer (CC) 640 

The CC 640 interfaces to the node Broadcast and Control Interface (BCI) 606 for the set of nodes In an 
array duster 635. (For a 4d modified hypercube with 6 nodes per ring that means the CC 340 Is attached to 
64 BCIs 605 In an 8 by 8 array of nodes and is controlling 612 PMEs. Sixty-tour such clusters, also In a 8 

as by 8 array, lead to the full up system with 32768 PMEs,) The CC 640 wft send commands and data 
supplied by tte C$ $60 to the BCi parallel port and return the acknowledgement data to the C$ 690 when 
operating in MIMD modes- in SIMO mode the interface operates syiKhronousty, and step by step 
acknowledgments are not required. The CC 640 also manages and monitors the high speed serial port to 
determine when PMEs wHhln the nodes are requesting services. Such requests are passed upward to the 

«o C$ 650 while the raw data from the high speed serial interface is made available to the status tfsptey 
interface The CC 640 provides the CS 660 wish an interface to specific nodes within the cluster via the 
standard speed serial interface. 

In SIMD mode the CC win be directed to send Instructions or addresses to all the PMEs over me 
broadcast bus. The CC can dispatch 18 btt instruction to all PMEs every 40 nanoseconds when in SIMD 

K mode. By broadcasting groups of native instructions to the PME, the emulated Instruction set is formed. 

When in MiMO mode the CC wilt waft for the endop signal before issuing new instructions to the PMEs. 
The concept of the MIMD mode is to build airings of micro-routines with nafre Instructions resident In the 
PME. These strings can be grouped together to form the emulated instructions, and these emulated 
instruction can be combined to produce service --'canned routines or library functions. 

so When In S1MD/M1MD (SIMIMD) mode, the CC will Issue instruction as if In SIMD mode and check for 
endop signals horn certain PMEs. The PMEs mat are in MIMD wdi not respond to the broadcast instructions 
and will continue whh there designated operation. The unique status Indicators will help the CC to manage 
this operation and determine when and to whom to present ihe sequential Instructions. 

55 Operational Software Levels 

Th^s application overviews the operational software S/W levels to provide further explanation of the 
services performed by various hardware HW components. 
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Computer systems generally used have an operating system. Operating system tomete which are 
relatively complete must be provided In most maeeive MIMD machines, where workstation ctass CPU chips 
run kernels such a$ Mac*. Tbe operating system kernel supports message passing ox memory coherency. 
Other massively parallel systems based upon SIMD models have almost no intet&gence in the array. There 
s are no "program counters" out In the array, and thus no programs (o execute locally. Ail Instructions are 
broadcast. 

In the systems we have provided with our PME as the basis tor cluster arrays, there is not need for an 
operating system at each chip, a node* We provide a library of toy functions for computation and/or 
ccimmuntafiton within each PE (PME) that can be invoked at a high tevel. SiMD-Kke Insmicaons are 

w> broadcast to the anay to sel each of a selected set of PMEs. These PMEs can then perform m fall MIMD 
mode one or more of these library routines. In add3on« baste irwerupt handler and communications routines 
are resident in each PME allowing the PME to handle communication on a dynamic oasis. Unlike existing 
MIMU machines. Ihe APAP structure need not include an entire program in PME memory. Instead all of lhal 
code, which is essentially serial, Is the cluster controller. Thus such code* 80% by space and 10% by time 

>5 (typically) can be broadcast in a SIMD feshio to an array of PMEs. Only the truly parallel inner loops are 
al&tributed to ihe PMEs In a dynamic fashion. These are then initiated Into MIMD mode just as other 
-library" routines are. This enables use of program models which are Single Program Mulnple data to be 
used where the same program ts toaded in each PME node, wim embedded synohronJzailon code, and 
executed at Ihe local PME. Design parameters affect bandwidth available on different links, and the system 

20 paths are proflremmaticalry configurable, allowing nigb barrdwith links on a target network, and allowing 
dynamic partition of off chip like PME-to- PME links to provide more bandwidth on specific paths as meets 
the needs of a particular application. Tne links leaving a chip mate directly with each other, without the need 
for external logic There are sufficient links and there is no predesigned constraint as to which otfter links 
they can attach to, so lhal the system can have a diversity of interconnection topologies, with routing 

23 performed dynamically and programmaHcally. 

The system allows usage of existing compilers and parsers to create an executable parai&l program 
which could run on a host or workstation based configuration Sequential source code for a Single Program 
Multiple Data system would pass through program analysis, for examination of dependency, data and 
controls, enabling extension of program source to include can graphs, dependency tables, aliases, usage 

3D tables and the like. 

T&erafier, program transformation would occur whereby a modified version of the program would be 
created that extends the degree© of parallelism by combining sequences or recognizing patterns to 
generate exp&clt compiler directives. A next step would be a data allocation and partitioning slop* with 
message operation* which would analyze data usage patiemsnd allocate so fo* elements to be combined 

as would share common indexing, addressing pattern, and these would provide embedded program compiler 
directives and calls to communication services. At this point the program would pass to a level partitioning 
step, A level partitioning step would separate the program into portions for execution in ARRAY, in ARRAY 
CONTROLLER (array director or cluster controller), and HOST, Array portions would be interleaved In 
sections with any required message passing synchronization functions. At this point level processing could 

* proceed. Host sources would pass to a level compiler {FORTRAN) tor assembly compilation, cornrotter 
sources would pass to a microprocessor controller compiler, and items lhal would be needed by a single 
PME and not avadabfe in a library caS would pass to a parser (FORTRAN OR 0) to an intermediate level 
language representation which would generate optimized PME code and Array Controller code. PME code 
would be created at PME machine level and would include library extensions, which would pass on load 

45 into a PME memory. During execution a PME parallel program in the spud process of execution could call 
upon already coded assembly service functions from a runtime library kernel. 

Since Ihe APAP can function as either an attached unit that is closely or loosely coupled wtlh Its host or 
as a stand alone processor, some variaaco In the upper level 8fW models exists. Hcwever. these variations 
serve to integrate the various type applications so as to permit a single set of tower level functions to satisfy 

so aJl three applications. The explanation will address ihe attached version SAW first and then Ihe modifications 
required tor standalone mod&s. 

ti any system, as illustrated by FIGURE 10, where the APAP is Intended to aitach to a host processor 
the user's primary program would exist within the host and would delegate to the APAP unit tasks and 
assocfawci data as needed to provide desired load balancing, The cnotee of imerpreting ihe dispatched 

S3 task r s program within the host or the Array Director is a user option. Host level interpretation permits the 
Array Director to work at interleaving users which do not exploit close control of the ArraVi while APAP 
Ntierpretation leads to minimal latency In control branching but tends to limit the APAP time to perform 
multi-user management tasks. This leads to the concept (hat the APAP and host can be tightly or loosely 
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coupled 

Two examples Hfeistrate the extremes: 

i. When APAP i* attached to 3090 das* machines with F(o«ting Potor Vector FaciGies, user data in 
compressed form could be stored within the APAP, A host program that called for a vector operation 

fl upon two vectors with differing sparseness characteristics would then send instructions to the APAP to 
realign the data into element by element matching pare, output the result to ma Vector Facility, read 
answer from the Vector FaciOty and finally reconfigure data Into final sparse data form. Segments of the 
APAP would he Interpreting and building sparse matt* bit maps, while Other sections would be 
calculating how to move data between PMEs such that it would be property aflgned for the zipper. 

w 2. With APAP attached to a smati inflight military computer, the APAP could be performing the entire 
workload associated with Sensor Fusion Processing. The host might initiate the process once, send 
sensor date as 3 was received to ihe APAP and then wait *or results. The Array Director would then have 
to schedule and sequence the PME array through perhaps dozens of processing steps required to 
perform toe process. 

rs The APAP will support three levels of user control: 

1. Casual User. &he works with supplied routines and library function. These routines are maintained at 
the host or API level and can be evoked by the user via subroutine calls within his program. 
5. Oustomizer User, S.-he can write special tuncaons which operate wRhin the API and which directly 
evoke routines supplied with the API or services supplied vrith the CC or PME. 

so 3. Development User. Sfte generates programs for execution in the CC or pive. depending upon API 
services for program load and status feedback. 

Satisfying these force user levels in either closely of loosely coupled systems leads to the partitioning 
of t*W comrol tasfca, 

API Software Tasks 

The application program interface API contains 3W services that can test the leading words of data 
received and can determine whether that data should be interpreted by the API, loaded to some storage 
within the Array Director or PME, or passed to the I/O zipper. 

so For data that is lo be Interpreted, (he API determines the required operation and invokes Ehe function. 
The most common type operation would cell for the Array to rxrfbrm some function which would be 
executed as a result of API writes to the CS (and indirectly to the CC), The actual data written to the CSCC 
would In general be constructed by the API operational routine based upon the parameters passed Eo the 
API from the host. Data sent to the CS/CC would in turn be forwarded to the PMEs via the node BCL 

38 Data could be loaded to either API storage, CC storage, or PME memory. Further, dais to be foaded to 
pme memory could be loaded via either the I/O zipper or via the node bcl For date to be put into the API 
memory, the incoming bus would be read then written io storage. Data targeted to the CC memory would 
be similarly read and then be written to the CC memory. Frnafty. data for the PME memory (in Sits case 
normally new or additional MIMD programs) could be- sent to all or selected PMEs via the C$/CC/Node BCI 

a or to e subset of PMEs ior selective redistribution via the VO zipper. 

When data is to bs sent to the VO zipper, t! could be preceded by inline commands thai permit the 
PME MIMD programs to determine its ultimate target or. it could be preceded by calls to the API service 
functions to perform either MIMD initiation or SIMD transmission. 

In addition to responding to requests for service received via the host interface, the API program will 

« respond to reojuest from the PMEs* Such requests will be generated on the high speed serial po« end will 
be routed through the COCS combination. Requests of this sort can result in the API program's directly 
servicing the PMEs or accessing (he PMEs via the standard speed serial port to determine further qualifying 
data relative to the service request. 

so PME Software 

The software plan includes; 

o Generation of PME resident service routines (that ls> 'an extended ISA') tor complex operations and 
i/o management 

35 o Definition and development of controller executed subroutines rhat produce and pass control and 
parameter data to the PMEs via the BCI bus. These subroutines: 
i. cause a set of PMEs to do msrthematfcai operations on distributed objects, 
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2. provide I/O data management and synchronization services for PME Array and System Bus 
Interaction*, 

3. provide startup program toad, program overlay and program task management for PMEs. 
o Development of data allocator) support services tor host level programs, and 

s o Development ot a programming support system including assembler, simulator. and HAV monitor and 
debug workstation. 

Based upon studies of military sensor fusion, opftmkatson, Image transformation, US Post Office optical 
character recognition and FBI fingerprint matching applications, we have concluded that a parallel processor 
programmed with vecior and array commands (like BLAS calla) -would be effective. The underlying 
io programming model must match the PME ana/ characteristics feasible with today's technology. Specffl- 
cslty: 

o PMEs can be Independent stored program processors, 
o The array can nave thousands of PMEs. and be suitable for fine grained parallelism, 
o Inter-PME networks will have very high aggregate bandwidth and a small 'logical diameter' , 
u o Buti by network connected microprocessor MIMD standards, each PME Is memory fimHed. 

Prior programming on MIMD parallel processors has used task dispatching methodology. Such 
approaches lead to each PME needing access to an portion of a large program. This characteristic in 
combination wish the non-shared memory characteristic ot the H/W, would exhaust PME memory on any 
signiffcartt problem. We therefore target what we believe is a new programming mod**, called 
a> "asynchronous 8IMD f (A-SIMDi type processing, m ihis connection see U3SN 7g$,79S, filed November 27, 
1991 of P. Kogge, which is incorporated herein 

A-SIMD programing in our APAP design means that a group of PMEs wffl be directed by commands 
broadcast to mem as In 8iMD models. Hie broadcast command will" initiate execution oi a MIMD function 
wShin each PME, That execution can involve data dependent branching and addressing within PMEs, and 
25 10 based synchronization with either other PMEs or the BC1. 

Normally. PMEs wfll complete the processing and synchronize by reading the next command from the 
BCI. 

The A-S1MD approach includes both MIMD and SIMD operating modes. Since the approach imposes no 
actual time limits on the command execution period, a PME operation thai synchronizes on data transfers 

» and executes indefinitely can be initiated. Such functions are very effective in data filtering, DSP. and 
systolic operations, (They can be ended by either BCi interrupts or by commands over the serial control 
buses.) $IMD operation results from any A-SlMD control stream that does not induce MIMD Mode 
Commands, Such a control stream can include any ot the PME* native instructions, These mstrucfows are 
routed directly to the instruction decode logic of ihe PME, Eliminating the PME instruction fetch provides a 

a* higher performance mode for tasks that do not involve data dependent branching. 

This programming model (supported by H/W features* extends to permitting the array of PMEs to be 
divided into Independent sections. A separate A-Slfo© command stream controls each section. Our 
application studies show that programs of interest divide into separate phases tie- input, input buffering, 
severs! processing steps, and output formatting, etc,), suitable for pipeline data processing. Fine-grained 

40 parallelism results from applying the n PMEs in a section to a program phase. Applying coarse-grained 
partitioning to applications often results in discovering small repetitive tasks suHable ior MIMD or memory 
bandwidth limited tasks suitable for SIMD processing. We program the MIMD portions using converrtionaJ 
techniques and program the remaining phases as A-SIMD sections, ceded with vectorized commands, 
sequenced by the array comrofier This makes ihe large controller memory the program store. Varying the 

« number of PMEs per section petmhs balancing the workload- Varying the dispatched ask size permits 
balancing the BCI bus bandwidth to the control requirement 

The programming model also considers allocating data elements to PMEs, The approach is to distribute 
data element* evenly over PMEs, In early versions of SW, this will be done by the programmer or by 3W. 
We recognize that fte IBM parallelizing compiler technologies apply to this problem and we expect to 

so investigate their usage. However, the inter-PME benchvidth provkted does tend to reduce the importantly or 
this approach. Ibis links data allocation and I/O mechanism performance. 

The H/W reqMlres ftat the PME Initiate data transfers cut of Its memory, and it supports a controlled 
write into PME memory without PME program IrrvolvernenL Input control occurs in the receiving PME by 
providing an Input buffer address and a maximum length. When I/O to a PME results in butter overflow, 

ss HfW will interrupt the receiving PME The low level VQ functions that will be developed tor PMEs build on 
this service. We will support either movement of raw data between adjacent PMEs or movement ot 
addressed data between any PMEs. The last capability depends upon the circuit switched and store and 
forward mechanisms. The interpret address and forward operation is important for performanoe. We have 
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optimized the H/W and S?W to support the o pa ration. Using one word buffers results in an interrupt upon 
receipt or address header. Comparing target Id with local Id permits output path ©ejection. Transfer of the 
subsequent data words occurs in circuit switched mode. A slight variation on thte process using larger 
buffers results in a store and forward mechanism. 

5 Because of the high performance Inter- PME bandwidth, it is not always necessary or desirable to place 
data eiamsitis wrsirn the PME Array carshily. Consfeter sniffing a vector daft element distributed across 
PUEs. Our arch lecture can send data without an address header, thus, providing for very fast I/O. However, 
we have found, in many applications, that oottmizlrio; a data structure tor movement In one direction, 
penalizes data movement m en orthogonal direction. The penalty m such aquations approximates the 

m average cost of randomly routing data in the network. Thta leads to applications where placing data 
sequentially or randomly (m opposed to arranging data* results in shorter average process times. 

Many applications can be synchronized to take advantage ot average access time. (For example, PDE 
relaxation prooesses acquire data from a neighborhood and thuB. can average access over at least four I/O 
operations.) we believe that after considering the factors applicable to vector and array processes, like 

J5 scoiterVgether or row'cotumn arithmetic, many users will find brute force dsns allocation to be suitable for the 
application. However* we know of examples that Illustrate appflcatton characteristics (ftka required synchro- 
nization or biased utilization of shift directions 1 ) that tend to force particular data allocation patterns. This 
characteristic requires that the tools and techniques developed support either manual tuning ot the data 
placement, or simple and non-optimum data allocation. (We will support the non-optimum data allocation 

20 strategy with host lavel macros to provide near transparent port of vectorized host programs to the mpp. 
The H/W Monitor wortetadon will permit the user to investigate the resuham performance.} 

FIGURE 10 shows Uie general S/W development and usage environment The Host Application 
Processor is optional in that program execution can be controlled from eriher the Host or me Monitor. 
Further, the Monitor will effectively replace the Array Controller Is some situations. The environment will 

2$ support program execution on real or simulated MRP hardware. The Monitor Is scenario driven so that the 
developer doing test and debug operations can create procedures to permit effective operation et any level 
of abstraction. 

FIGURE 20 Blustrates fre levels ot H/W supported within me MPP and the user Interfaces to these 
levels, 

so We sea two potential application programming techniques far the M PP. In the least programmer 
intensive approach. appliceSona would be wrriten in a vectorized high order language. If the user dfci not 
feel that the problem warranted tuning data placement then he would use compile time services to allocate 
data to the PME Array. The application would use vector calls like BLAG dial would be passed to the 
controller for interpretation and execution on the PME Array, Unique calls would be used to move data 

as between host and PME Array, m summary, the user would not need to be aware of how the MPP organised 
or processed the data. Two optimization lechniques will be supported for this type application: 

1. Altering tlte date allocation by constructing m data allocation table will permit programs to force data 
ptecements. 

2. Generation of additional vector commands for execution by the array controller wlP permit tuned 
<o subnotions <te. calling the Gaussian Elimination as a single operation.) 

vVo also see thai the processor can be applied to specialized applications as in those referenced in the 
beginning of &ns section, to such cases, cede tuned to the application wpuw be used, However, even in 
such applications the degree of tuning will depend upon how important a particular las* Is to the application. 
R is in this situation thai we see the need for tasks individually suited to SIMD, MIW1D or /V8IMD modes. 
4s These programs will use a combination oft 

1. Sequences of PME native Instructions passed to an emulator function within the array controller. The 
emulator will broadcast the instruction and Us' parameters to the PME set The PME& In this SIMD mode 
win pass the instruction to the decode function, sSmuiathtg a memory fetch operation. 

2. Tight inner loops that can be I/O synchronized will use PME native tSA programs- Aiter initiation from 
so a SIMD mode change, they would run continuously In M1MD mode. (The option to return to SIMD mode 

vie a 'RETURN* instruction exists,) 

3. More complicated programs, as would be written in a vectorizing command set, would execute 
subroutines fcn the array controller that Invoked PME native functions. For example a simplified array 
controller program to do a BLAS 'SAXPY* command on vectors loaded sequemiaDy across PMEs would 

58 

1 Gaussian Elimination with normal pivoting rooulres shifting rows but not columns. More than £1 performance 
difference would resuli from arranging .he data such that columns ware on the fast shift direction. Even with 
thai there is not an advantage to arranging rows in any particular relationship to the buses. 
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start sequences wfthin the PMEs ihat 

a. Enable PMEb with required x elements vte comparison of PME Id wtih broadcast Tncx' and 

b, Compress tfw k values vw a writs consecutivs PMEs, 

s c Calculate the address of PMEs wfcih y elements from broadcast data, 

d. Transmit tha oompresssd x data to ttie y PMEs, 

e. Do a single precision floating point operation in PMEs receiving x values to complete the operation. 
Finally k the SAXPY example illustrates one additional aspect of executing vectorized epp5cetk>n 

programs, The major steps execule in the API and could oe preojemrned by eHher an optimizer or product 
io developer. Normally, the vectorized application would call rather than include this loved o code. These steps 
would be written as C or Fortran code and wffi use memory mapped read or writes to control the PME array 
via the 6CI bus. Such a program operates the PME array as a series or M(MD steps synchronized by 
returns to ihe API program. Minor steps such as the single precision floating point routines would be 
developed by flhe Custom Izer or Product Developer. These operations vM be coded using the native PME 
jc ISA and wfll be tuned to the machine characteristics. In general, this would be the domain of the Product 
Developer since coding, test and optimization ai this level require usage of the complete product 
development tool set 

The APAP can have applications written in sequential Fortran. The path is quite dKferent FIGURE 21 
outlines a Fortran compiler which can be used. In the flrst step. IL uses a portion of the existing parallelizing 
to compiler to develop program dependencies- The source plus these tables become an input to a process 
that uses a characterization of the APAP MMP and the source to enhance parallelism. 

This MMP is a norvshared memory machine and as such allocates date between the PMEs for local 
and global memory. The very fast data transfer time* and the high network bandwidth reduce the time 
affect ol data evocation, but it stiB is addressed. Our Gpproach treats part of memory as global and uses a 

29 SArV service function. It Is also possible to use the dependency information to perform the data allocation In 
a second aHsmaave. The final step in converting the source to muKipie sequential programs is performed 
by the Level Partitioning step This partitioning step is analogous to the i Fortran sup 3:ef work being 
conducted wtth DARPA funding. The lest process in the compilation is generation of the executable code at 
all individual functional levels. For the PME this will be done by programming the code genertraor on an 

30 existing compiler system. The Host and API code compilers generate the code targeted to those machines. 

The PME can execute Mtt/ID software from its own memory. In general, the multiple PMEs would not 
be executing totally different programs but rather would be executing the same small program in an 
asynchronous manner. Three baste types of SJW can be considered a&hough the design approach does not 
limit the APAP to Just these approaches: 
as 1. Specialized emulation functions would make tre PME Array emulate the set ot services provide by 
standard user libraries like unpack or vp$$. in such an emulation package* the PME Array could be 
using its mulSpte set of devices to perform one of the operaSons required in a normal vector can. This 
type of emulation, when attached to a vector processing unit, oould utilize (ho vector unit for some 
operations while performing others Internally. 
4* 2. The pemlleliern the PME Array oould be exploited by operating a set of software that provides a 
new set of mathematical and service functions in the PMEs. This set of primitives would be the codes 
exploited by a customizing user to construct his application. The prior example of performing sensor 
fusion on a APAP attached to a military platform would use such an approach. The cusfomlzer would 
write routines to perform Kalman Fitters* Track Optimum Assignment and Thread Assessment using trie 
« supplied set of function names. This application would be a series of API call statements, and each call 
would result in initiating the PME set to perform some basic operation like 'matrix muttfaly* on data 
stored within the PME Array. 

3. In cases where no effective method, considering performance objectives, or application needs exists 
then custom S/W could be developed and executed wrthln the PME. A specific example Is "Sort 1 . Many 

so methods to sort data exist and the objective In all cases Is to tune the process and the program lo the 
machine architecture. The modified hypercube is well suited to a Batcher Sort; however, that soil 
requires extensive calculations to determine particular elements to compare versus very short compari- 
son cycles. The computer program in FIGURE 17 shows a simple example of a PME program iiOO to 
perform the Batcher Sort 1000 with one element per PME. Each line ot *e program description would be 

ss expanded to 3 to 6 PME machine level instructions, and afi PMEs would then execute the program In 
MIMD mode- Program synchronization is managed via the i/G statements. The program extends to more 
data elements per PME and to very large parallel sons in a quite straight forward manner. 
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CO Storage Contents 

Date from the CC storage is used by the PME Array In one of two manners When the PMEs are 
operating in 8IMD, a series of instructions can Em fetched by tho CC and passed to the node BCI, thus, 
s reducing load on both Uie API and OS. Alternatively, Functions thai are not frequently required, such as PME 
Fault RaoonfiguraHon &W. PME Diagnostics, aitf perhaps conversion routines can be stored in the CC 
memory. Such functions can then be requested by operating PME MIMD programs or moved to me PMEs 
at the request of API program directives- 

io Packaging of the 8-Way Modified Hypercube 

Our packaging techniques fate advantage of ft* eight PMEs packaged in a single chip and arranged in 
a N-dirnensionai modified hypercube configuration. This chip level package or node of the array is tna 
smallest building block in the APAP design. These nodes are ften packaged in an 8 X 8 array where the +- 

ts X and the +-Y makes rings within the array or cluster and the end +-Z are brought out to the 

neighboring clusters. A grouping of dusters make up an array. This step significantly cuts down wire count 
tor data and control for the array. The W and Z buses will connect to the adjacent clusters and fonn W and 
2 rings to provide total connectivity around the completed array of various ska The massively parallel 
system wfll be comprised of Aese cluster building blocks to form the massh/e array of PMEs. The APAP 

jo win consist of an 8 x 8 array of clusters, each cluster will have its own controller and all the controSet* win 
be synchronized by our Array Director. 

Many trade-offs of wlreebiGty and topology have been considered, yet with these considerations we 
prefer the configuration which we illustrate with this connection. The concept disclosed has the advantage of 
keeping the X and Y dimensions within a cluster level of packaging, and distributing the W and Z bus 

29 connections to all the neighboring clusters. 

After implementmg the techniques described, me product will be wireabte. and rnanufacturable while 
maintaining the inherent characteristics ol the topology defined. 

The concept used here Is to mix, match, and modify topologies at different packaging levels to obtain 
the desired results in terms of wire count. For the method to define the actual degree of modification of the 

3D hypercube, refer to {he Rolfe modified hypercube patent application referenced above. For the purpose of 
this preferred embodiment we will describe two packaging levels to simplay our description- ft can be 
expended. 

The first Is the chip design or chip package Illustrated by FIGURE 3 and FIGURE 11 . There are eight of 
the processing elements with their associated memory and communscaaon logic encompassed into a single 

38 chip which Is defined as a node. The Internal configuration Is classified as a binary hypercube or a 2-degree 
hypercube where every PME is connected to two neighbors. See the pme-pme communication diagram in 
FIGURE 9. especially 500, 51 0. 520. 530, 540. 550, 560, 57-X 

The second step is that the nodes are configured as an 8 X & array to make up a cluster. The fully 
populated machine is built up of an array of 6 X & clusters to provide the maximum capacity of 02798 

<o PMEs. These 4095 nodes are connected in an 8 degree modified hypercube network where the commu- 
nication between nodes is programmable. This ability to program different routing paths adds flexibttiy to 
transmit different length messages. In addition to message length differences, there are algorithm optimiz- 
ations that can be addressed with these programmaWllty features. 

The packaging concept is intended to significantly reduce Ihe off page wire count for each of the 

« dustera This concept takes a duster which is defined as a 8 x 8 array of nodes 820, each node 825 having 
& processing elements for a total of 512 PMEs, then to limit the X and Y ring wnhin the cluster and, finally, 
to bring out the W and Z buses to all clusters. The physical picture could be envisioned as a sphere 
configuration 800, 8i0 of 64 smaller spheres 830. See FIGURE 15 tor a future packaging picture which 
illustrates the iufl up packaging technique, limiting the X end Y rings 800 wrihm the duster and extending 

jo out the W and Z buses to all clusters 810. The physical picture could be envisioned as a sphere 
configuration of 64 smaller spheres 8QO. 

The actual connection of a single node to the adjacent X and Y neighbors 975 exists within the same 
cluster. The wiring savings occurs when the z and W buses are extended to the adjacent neighboring 
clusters as illustrated In FIGURE 16. Also illustrated In FIGURE 16 is the set of the chips or nodes that can 

35 be configured as a sparsely connected ^dimensional hypercube or torus dCO, 305, $10, dl5. Consider each 
of the 8 external ports to be labeled as +x, + Y« +z, ♦ w, -X, -Y, -z> -w 950* 975. Then; using m chips, a 
ring can be constructed by connecting the +X to -X ports. Again m such rings can be interconnected into a 
ring of rings by interconnecting the matching +Y to -Y portB. This level of structure wiH be called a cluster. 
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(t provides for 512 PMEs and will be the building block lor several sbb systems. Two such connections 
(950, 975) are shown In the diagram illustrated In FIGURE 16. 

Applications lor Peskside MPP. 

5 ^ " 

Til* dssfcside MPP in a workstation can be effectively applied in several application areas including: 
1. Small production tasks that depend upon compute intensive processes The US Postal Service 
fequtos a processor that can accept a rax Image of a machine printed envelope and than find and read 
the sip code. The process la needed at all regional sort facilities and (s an example of a very rapeifflve 
70 but still compute intensive process. We have implemented APL language versions of a sample of the 
required programs- These models emulate the vector and array processes that win be used to do the 
work on ihe MPP. Baaed upon this test, we know that the task is an excellent match to ft* processing 
architecture. 

Z. Tasks in which an analyst, as a result of prior output, or expected needs requests sequences of data 
is transformations. In an example taken from the Defense Mapping Agency, sate Dree images are to be 
transformed and smoothed pkaal by pixel Into some other coordinate system. In such a situation, the 
transformation parameters for the image vary across localities as a result of ground elevation and slops. 
The analyst must therefore add fixed control points and reprocess transfonmafjons. A similar need occurs 
in the utilization of scientific simulation results when users require almost real time rotation or perspec- 
20 tjve changes. 

3, Program development for production versions of ihe MPP will use workstation size MPPs. Consider a 
tuning process that requires analysis of processor versus network performance. Such a task is machine 
and analyst interactive. It can require hours when the machine is kJie and the analyst is wonting. When 
performed on a supercomputer ti would be very costly, Mowever, providing an affordable workstation 

29 MPP with the same (but scaled) characteristics as 3ie supercomputer MPP eliminates costs and eases 
the test and debug process by eliminating ihe programmer inefficiencies related to accessing remote 
processor*, 

FIGURE 22 la a drawing of the workstation accelerator. It uses the same size enclosure as the 
RiSC/6000 model 530. Two swing out gates, each containing a fun cluster are shown. The combined two 

30 clusters provide 5 GOPS of fixed point performance and 530 MflopS of processing power and about 100 
Mbyie/s of I/O bandwidth to the array. The unit would be suitable for any of ihe prior applications. With 
quantity production and induding a host RISC/6006, ri would be price comparable with high performance 
workstations, not at the price of comparable machines employing old technology. 

35 Description of the AWAC3 Sensor Fusion 

The military environment provides a series of examples allowing the need for a hardened compute 
intensive processor. 

Communication In the targeted noisy environments implies the need for digitally encoded ccrnmunica- 
40 tions. aa is used in tCMlA systems. The process of encoding the data for transmission and recovering 
information after receipt is a compute intensive process. The task can be done with specialized signal 
processing modules, but for situations where comrmimcatlon encoding represents bursts of activity, 
specialized modules are mostly idle, Using the MPP permits several such tasks to be allocated to a singe 
module and saves weight, power, volume and cosL 
«s Sensor data fusion presents a particularly dear example of enhancing an existing platform with the 
compute power gained from the addition of MPP. On tie Air Force E3 AWAC9 there are more than tour 
sensors on the platform, but there le currently no way lo generate tracks resulting From the integration or ail 
available date. Further, ihe existing generated tracks have quite poor quality due to sampling characteristics. 
Therefore, there Is motivation to use fusion to provide an effective higher sample rate, 
so We have studied Shis sensor fusion problem in detail and can propose a verifiable and effective solution, 
but that solution would overwhelm the compute power available in an AWACS data processor. FIGURE 23 
shows the traditional track fusion process- The process is faulty because each of the individual processes 
tends to make some errors and the final merge tends to collect them Instead of eliminating them. The 
process also characterised by high time latency in that merging does not complete until the slowest 
sensor completes. FIGURE 24 presents an irnprovement and the resulting compute intensive problem with 
the approach. Although we cannot solve a NP-Hard problem, we have developed a good method to 
approximate m solution. While me details of that application are being described by the inventors 
elsewhere, as it can be employed on a variety of machines like an Intel Touchstone with 512 i860 <809G0> 
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processors, and IBM's Scientific Visualization System, il ccn be used as an application suftsbte for tfra MMP 
using the APAP design described here with say 128,000 PMEs k substantially outperforming iheso other 
systems. Application experiments show the approximation quality below the level of sensor wise and as 
such the answer Is applicable? to applications like AWACS- FIGURE 25 shows the processing loop involved 

3 in ihe proposed Lagrangean Reduction n-olmensionsJ Assignment algorithm, "me problem uses very 
conned repetitions of the well known 2-dimensfcmal assignment problem, the same algorithm ttiat 
classical sensor fusion processing u s©3. 

Suppose for example that the rniimensiooai algorithm was to be applied to the seven sets of 
observations Illustrated in FIGURE 24 and further, suppose mat each peas through a reduction process 

ro required four iterations through a 2d Assignment process. Then the new fid Assignment Problem would 
require 4Q00 iterations of the 2d Assignment Problem. The awaC$' workload ft now about 00% of machine 
capacity. Fusion perhaps requires 10% of the total effort, but even that small effort when scaled up 4030 
tfrves results in toial utilization ba^g 370 femes the capacity of an AWACS. Not onry does this workload 
overwhelm the existing processor, but if would be marginal In any new MIL environment suited, coarse- 

;s grained, parallel processing system currently existing or anticipated in the next few years. If the algorithm 
required an average of 5 rather than 4 iterations per step, then it would overwhelm even the hypothesised 
systems. Conversely, the MPP solution can provide the compute power and can do so even at the 6 
iteration level. 

so Mechanical Packaging 

As illustrated tn FIGURE 3. and other FIGURES, our preferred chip is configured In a quadflaipack form. 
A$ such it can be brfekwailed Into into various 2 O and 3 O con? iguranons in a package. One chip of eight or 
more processor memory elements is a first level package module, the same as e single DRAM memory 

20 chip t$ to a foundry which packages the chip. However* it Is In a quadfletpack form, allowing connections to 
one another in tour directions. Each connection is point to point. (One chip in its first level package is a 
module to (he foundry.) We are able to construct PE arrays of sufficient magnitude to hit our performance 
goals due to this feature- The ree% is mat you can connect these chips across 3, 4 or even five «eet, point- 
to-point, i.6, mutt-processor node to node, and etifl have proper control without the need of fiber optics, 

so Hue has an advantage for Ihe drtve/recefve circuits that are required on the modules. One can achieve 
high performance and keep the power dissipation down because we do not have bus systems that daisy 
chain from module to module. We broadcast from node to node, but this need not be a high performance 
path. Most data operations can be conducted in a node, so data path requirements are reduced. Our 
broadcast path is essentially primarily used as a controller routing took The data stream attaches to and 

38 runs in. the ZYVXY communication path system. 

Our power dissipation is 22 waits per node module for our commercial workstation. Tiiie allows us to 
use air cooled packaging. The power system requirements for our system are aiso reasonable because of 
this fact Our power system illustrated multiplies the number of modules supported by about 23 watts per 
module, and such a five volt power supply Is very cost effective. Those concerned with the amount of 

<o electricity consumed would be astonished that 32 microoomputdrs could operate with lose than the wsttege 
consumed by a reading fight. 

Our thermal design is enhanced because of fte packaging. We avoid hot spots due to high dissipating 
parts mixed with tow dissipating ones. This reflects directly on the cost of the assemblies. 

The cost of our system is very attractive compared to the approaches that put a superscalar processor 

4$ on a card. Our performance level per assembly per watt oer connector per pan tyoe per doPar is excellent 
Furthermore, we do not iw?d the same number of packaging levels that the other technology does, We 
do not need rrwolite/card/backplane and cable. We can skip the card level if we want to. As Illustrated In our 
workstation modules, we have skipped me card level wim our brickwaiied approach. 

Furthermore, as we illustrated in our layout, each node housing which is brick watted m the workstation 

so modules, can as illustrated In FIGURE 0 comprise multiple recreated dies, even within the same chip 
housing. While normally we would place one die within an air cooled package, it is possible to place 8 ens 
on a substrate using a multiple chip module approach- Thus, the envisioned watch with 32 or mora 
processors, is possible, as well as many oiher applications. The packaging and power and flexibility make 
applications whteh are enrjiess, A house could have Its controllable instruments ad watched, and coordi- 

5s aaled with a very small part Those many chips thai are spread around an automobile for engine watching, 
brake adjustment and so on could all have a monitor within a housing, in addition, one the same substrate 
with hybrid technology, one could mount a 386 microprocessor chip wtth full programmable capability and 
memory (all in one chip) and usa it as the array controller for Ihe substrate package. 
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We have shown many configurations or systems, from control systems. FIGURE 3, to larger and larger 
systems. The aaiity to package a chip with multiple processor memory element of eight or more on a chip 
In a clips with pinouts ffitfng hi a standard DRAM memory module, such in a SIM modulo mate possible 
countless additional applications ranging from controls to wall size video displays which can have a 

s repetition rata, not a the 13 or so frames that press the existing technology today, but at 30 frames, wltti a 
processor assigned to monitor a pixel, ot a node only a tew pixels. Our brfckwall quaditetpacfc mates ft easy 
to replicate the same part time over and over agala Furthermore, the replicated processor is really memory 
with processor interchange. Part of the memory can be assigned to a specific monitoring task, end anofrer 
part (with a size programmaticaJly defined) can be a massive global memory, addressed point-to-point, with 

jo broadcast to ail capability. 

Our basic workstation, our supercomputer, our controPer, our awaCS, an are examples ot packages 
that can employ our new technology. An array of memory, wim inbuilt CPU chips and M), functions as a 
PME of massively paraOe! applications, and even more limited applications. The llexibiftly of packaging and 
programming makes imaginations expand and our technology allows one pan to be assigned to many Ideas 

js and images. 

MjStary Avionics Appications 

The cos! advantage of constructing a MIL MPP is particularly wen illustrated by the AWACS. U is a 20 
to year oid enclosure that has grown empty space as new technology memory modules have replaced the 
original core memories- FIGURE 26 shows a MIL quoliftable ivta duster system that would fit directly into 
the rack's empty space and would use the existing memory Pus system for inierconnection. 

Although the AWACS example is very advantageous due to the existence of empty space, in other 
systems H b possible to create space. Replacing existing memory wtfh a smell MPP or gateway to an an 
29 Isolated MPP la normally Quits viable- In such cases, a quarter duster and a adapter module would result in 
a 8 Megabyte memory plus 640 mips and use perhaps two slots. 

Superoomrxrter Application 

a> A 04 cluster MPP is a 13.6 Gflop supercomputer. It can be configured in a system described in FIGURE 
27. 

The system we describe allows node chips to be brick walled on cluster cards as illustrated in FIGURE 27 
to build up sy slams wltti some significant coat and size advantages. There la no need to Include extra chips 
such as a network switch In such a system because it would increase costs, 

39 Our interconnection system with "brickwalled" chips allows systems to be built like massive ORAM 
memory Is packaged and wOl have a denned bus adapter conforming to (he rigid bus specifications* for 
instance a MicroChannel bus adaptor. Each system win have a smaller power supply system and cooling 
design than other systems based upon many modem microprocessors. 

Unlike most supercomputers our current preferred APAP with Hosting point emulation Is much faster In 

40 imager arithmetic (164- GIPS) than £ is when doing floating point arithmetic, Aa such, the processor would 
be most effective when used in applications that are very character or integer intensive. We have 
considered three program challenges which in addition to the other applications dseussed hereto are 
needful of solution. The applications which may be more important <han some of the "grand challenges' 1 to 
day lo day life include: 

<*g 1, 3090 Vector Processors contain a very high performance floating point erShmette unit That unit, as do 
most vectorized floating point units, requires pipeline operations on dense vectors- Applications thai 
make extensive use of norwegutar sparse matrices tf.e. matrices described by bit maps rather than 
diagonals) waste the performance cepablSty of the floating point unit The »P solves this problem by 
providing the storage for the data and using Its compute power and network bandwidth, not to do the 

so calculation but rather to construct dense vectors, and to decompress dense results. The Vector 
Processing Unit is kept busy by a continual flow of operations on dense vectors being supplied to it by 
the MPP, By sizing the MPP so thai it can effectively compress and decompress at the same rats the 
Vector Facility processes, one could keep both unite fuUy busy. 

2, Another host reached system we considered is a solution to the FBI fingerprint matching problem. 
89 Mere, a machine with more than 64- dusters was considered. The problem wa3 to match about 6000 
fingerprints per hour against the entire database of fingerprint history. Using massive da$d and the full 
eandwidih of the MPP to host attachment, one can roll the complete data base across the incoming 
prints in about 20 minutes, Operating about 75% of the MPP in a 8IMD rnode coarse matching 
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operation, balances processing to required throughput rate. We estimate that 15% of the machine m A- 
SIMD processing mods wouH (Den complete the matching by doing the detailed verification of unknown 
print versus file- print for cases passing the coarse ffiter operation The remaning portions of the machine 
were in MIMD mode and allocated to reserve capacity, work queue management and output formatting, 
a 3. Application of the MPP to database operations has been considered. Although the work is very 
prsCrra'nary. it does seen to be a good match. Two aspects of the MPP support this premise: 

a. The connection between a duster Coraroller and the Application Processor Interface Is a Mkro- 
CharmeL As such, It could be populated with DASD dedicated to the cluster and accessed directly 
torn the cluster, A &4 cluster system with sb< 640 Mbyte hard drives attached per cluster would 
10 provide 246 Gbyte Biorags. Furlhar, that entire database could be searched sequentially In 10 to 20 

seconds* 

b* Databases are generally not searched sequentially. Instead they use many levels of pointers. 
Indexing of databases can be done v&hirt the cluster. Each bank of DASD would be supported by 2.5 
GIPS of processing power and 32 Mbyte of swrage. That la sufficient for both searching and storing 
is the indices. Since indices are now frequently stored within the DASD, significant performance gains 

would occur. Using such an approach and dispersing DASD on SCSI Interfaces attached to tha cluster 
MicroChannel permits effectively unlimited size data bases. 
FIGURE 27 Is an illustration of the APAP when used to build the system Into a supercomputer scaled 
MPP. The approach reverts to replicating units, but here it is enclosures containing 16 clusters that are 
3> replicated. The particular advantage of this replication approach t$ that the system can be scaled to eutt the 
user's needs. 

System Architecture 

2s An advantage of the system architecture which is employed in the current preferred embodiment Is the 
ISA system which will be understood by many who mil form a pool for pro^rarnrning the APAP, The PME 
ISA consists of the following Data and Instruction Formats, illustrated in the Tables. 

USES rCnYltHS 

30 

The basic (operand) size is the 16 bit word. In PME storage, operands are located on integral word 
boundaries. In addition to the word operand size, other operand sizes are avatlabte in multiples of 16 bits to 
support additional functions. 

Within any of the operand lengths, the bit positions of the operand are consecutively numbered from 
35 left to right starting wr9i the number 0. Reference to high-order or most-slgnfficant bite always refer to the 
lofi-mosl bit positions- Reference to the low-order or lea^t-signiflcerii bits always refer to mo rightmost bit 
positions. 

Instruction Formats 

40 

The length of an instruclbn format may either be 16 bits or 32 bite. In PME storage. Instructions must 
be located on a 18 bit boundary. 

The following general Instruction formats are used, Normally, the test four bits of an Instruction define 
the operation code and are referred to as ihe OP bits. In some cases, additional bits are required to extend 
4* the definition of mo operation or to define unique conditions which apply to the Instruction. These bits are 
referred to as OPX bte. 
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Format Code 


Operation 


RR 


Register to Regteier 


DA 


Direc* Addrees 


RS 


Register Storaga 


Rl 


Register Immediate 


SS 


Storage to Storage 


SPC 


Special 



AH formats have one field m common. This field and its interpretation is: 

Bit* <K3 Operation Code - This field, sometimes in conjunction with en operation code extension 
'5 fietd. defines the operation to be performed. 

Detailed figures of Ihe individual formats along wilh Interpretations of iheir fields are provided In the 
following subsections. For some instruction^ two formats may bo combined to form variations on the 
instruction. These primarily involve the addressing mode for the instruction. As sn example a storage to 
storage Instruction may have a form which Involves direct addressing or register addressing. 

20 

RR FumiOt 

The Register-Register (RR) format provides two general regteter addresses and Is i$ bits m length as 
shown, 

29 



OP 


M 


8 8 9 6 


m 


1 1 1 


,.J ' 1 


I.I L. 


i i i 



6 3 4 7 B 11 1 
12 5 



39 

In addition 10 an Operation Code field* the RR format contains: 
Bits 4-7 Register Address i - Tf>e RA field is used to specify which of the 16 general registers is 

to be used as an operand and/or destination. 
Bits Ml Zeros - Bit 3 befog a zero defines the format to be a RR or DA forma* and bits 9*1 i 

equal to zero define m operajon to be a register to register operation \a special case or 

the Direct Address formal). 
Bits 12-15 Register Address 2 - The RB field is used to specify which of the 16 general registers la 

to be used as an operand. 
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The Direct Address <DA) format provides one general register address and one direct storage address 
as shown- 
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OP 

1 1 1 


RA 

i i i 


0 


DIR ADDR 
t i i i i i 


8 3 


4 7 


8 9 1 



5 



In addition to an Operation Code fold, the DA format contains; 

Bits 4-7 Regi&ter Address 1 - The RA field is used lo specify which of the 18 general registers ta 
to be used as an operand andtor destination, 
;s Bft 8 Zero - This bit being zero defines the operation to be a direct address op&raaon or a 

register to register operator). 
Bits 9-16 Direct Storage Address - The Darect Storage Address field to used as en address into 
the level unique storage block or the common storage block. Bits 6-11 of the direct 
address Held must be non-zero Id define the direct address form. 

so 

RS Format 



The Register Storage <RSj format provides one general register addresses end an indeed storage 
address. 

2* 



OP 


RA 


J 


DEL 


RB 


1 1 t 


1 ( 1 




.-.I..L 


II i . 



9 3 4 7 6 9 11 1 
12 5 



In addition to an Op&ation Code Held, me RS format contains: 

Bite 4-7 Register Address 1 - The RA fold is used to specify which of tie 16 general registers is 

to be used as an operand end/or destination. 
Bit 8 One - Trite ba being one defines the operation to be a roaster storage operation. 

Bits Ml Register Data - These bite are considered a signed value which is used to modify the 

contents ot register specified by the RB field. 
Bits 1*15 Register Address 2 - The RB fold Is used to specify which ot the 16 general registers te 

to be used as an storage address for an operand. 

Rl Format 



The Register-immediate (Rl) format provides one general register address and 16 bits of immediate 
data. The Rl formai is 32 brie ot length as shown: 

so 
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s 


OP 


fW 


1 


DEL 


8 G 0 0 


IMMEDIATE DATA . 




-Ml.. 


_1 1 I 




— 1_J. 


< 1 1 


1 1 I.J..J I'll 



3 * 



1 8 9 



1 1 

1 2 



1 

5 



3 
1 



In addition to an Operation Code field, the Rl tor mat contains: 



4*7 



Bit 8 
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9-11 



12-19 



16-31 



29 $$ Format 



Register Address 1 - The RA ftew is used to specify which of the i$ general registers is 
to be used as en operand end.br destination. 

One - This bft being one defines the operation to be a register storage operation. 
Register Date - These bits are considered a signed value which is used to modify the 
eontente of the program counter. NormaHy,tnls Held would have a value of one tor the 
register immediate Sonne!. 

Zeroes • The field being z$*o is used to specify tier the updated program counter, which 
points to the immediaiB date field, is to be used es m storage address for an operand. 
Immediate Data - This field serves as a 16 bit Immediate data operand for Register 
Immediate instiucaons, 
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The Storage to Storage (SS) format provides two storage addresses, one explicit: and (he second 
Implicit The Implied storage address Is contained in General Register 1. Register 1 is modified during 
execution of the instruction- There are two forma of a SS instruction, a direct address form and a storage 
address fcrni. 





OPX 






















(Direct 


OP 




0 


c 


U 


DIR AODR 


Address 




0 


V 


R 






Form) 




p 


F 


Y 








I { I 


1 








'Mill 





0 3 4 7 8 9 1 

5 
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OPX 


























(Storage 


OP 




0 


c 


1 


DEL 


Re 


Address 




0 


V 


R 








Form) 




p 


r 


Y 










I 1 1 


1 








i i 


i i i 





0 3 4 7 8 9 11 1 

! 2 5 
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In addition to an Operation Coda field the 33 formal conteins: 

Bits 4-7 Operation Extension Code - The OPX Held, together wtih the Operation Code, defines 
the operation » be performed. Bit* 4-5 define the operation type such as ADD or 
SUBTRACT. Bits 5-7 control the cany, overflow, and how ma condition code will be set. 
Bit 3 s 0 Ignores overflow, fait 6 = 1 allows overflow?. DM 7 = 0 Ignore the carry stat 
during me operation; bit 7 » 1 includes ihe carry stat during the operation. 

Bit 8 Zero - Defines the form to be a direct address form. One - Defines The form to be a 

storage address form. 

Bite 9-15 Direct Address (Direct Address For*) - The Otrcct Storage Address field ts ueed as 
an address into (he level unique storage block or the common storage block. Bite 9-1 1 of 
the direct address field must be non-zero to define the direct address lorm. 

Bits 9-11 Register Delta (Storage Address Form) - These bits considered a signed value 
which i6 used to modify the contents of register specified by the RB field. 

Bits 12.16 Register Address 2 (Storage Address Form) - The RB field is used to specify which 
of tfis 18 general registers is to be used es a storage address for an operand. 



SPC Foimat 1 

The Special (SPC1 ) formal provides one general reoisisr storage operand address. 
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OP 


OPX 


0 


LEN 


RB 


RR Form 


i i I 


ii i 




■ t 


...I I i 





3 4 



7 8 



1 1 
1 2 



3 4 



7 9 



1 1 

1 2 





OP 


OPX 


1 


LEN 


RB 


» 


! I.I 


i 1 1 




1 1 


.III 



RS Form 



h addition to an Operation Code field, the 3PC1 format contains: 



so 



Bits 4-7 
Bit 8 

Bite 9-11 



Bit? 12*15 



SPC Format 2 



OP Extension - The OPX Held * used to extend the operation code. 

Zero or One - This ba being zero defines §w operation to be a register operation. This 

bft being one defines the operation to be a register storage operation 

Operation Length - These bits are considered an unsigned value which la used to 

specify the length of the operand in 16 bit words. A value o? zero corresponds to a length 

of one, and a value of B'11 V corresponds to & length of etght 

Register Address 2 - The RB field Is used to specify which of fre 13 general registers is 

to be used as a storage address for the operand. 



The Special ($PC2) format provides one genera? regisier storage operand address. 



55 



46 



EPO570 729A2 



OP 

1 1 1 


RA 


OPX 
1 1 1 


RB 

f i i 


0 3 


4 7 


8 I 


1 1 



1 2 



jo 



In addition io an Operation Code field, the SPC2 format contains: 
Bits 4-7 Register Address 1 - The RA field is used to specify which of the 16 genera registers is 
to be used as an operand and Ax destination. 
75 Gfta 8-11 OP Extension - The OPX Held re used to extend the operation code. 

Bite 12-15 Register Address 2 - The RB field is used Id specify which of the 16 general registers is 
to be used as a storage address fot the operand. 

THE INSTRUCTION LIST OF THE ISA INCLUDES THE FOLLOWING: 



Table 1 (Pag« i erf 3). iWFoint AriihrneUc Insicucttor* 
NAME 

ADD DIRECT 



MNE- 
MONIC 

dda 



TYPME 
OA 



3Q 



SO 
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TaWe 1 fPas& 2 0* Freedom Arithmetic tostrwSona 





NAME 




TYPft 












ADD FROM STORAGE 


a 


RS 




(WITH DELTA) 


awd 


RS 




ADD IMMEDIATE 


ai 


Rl 


10 


(WITH DELTA) 


aiwd 


Rl 




ADO REGISTER 


ar 


RR 




COMPARE DIRECT ADDRESS 


cds 


DA 


)S 


COMPARE IMMEDIATE 


ci 


Rl 




(WITH DELTA) 


ciwd 


Rl 




COMPARE FROM STORAGE 


c 


RS 


20 


(WITH DELTA) 


cwd 


RS 




COMPARE REGISTER 


Cf 


RR 




COPY 


cpy 


RS 


20 


{WITH DELTA) 


cpywd 


RS 




COPY WITH BOTH IMMEDIATE 


cpybl 


Rl 




{WITH DELTA) 


cpybiwd 


Rl 




COPY IMMEDIATE 


cpyl 


Rl 


30 


(WITH DELTA) 


cpyiwd 


Rl 




COPY DIRECT 


cpyda 


DA 




COPY DIRECT IMMEDIATE 


cpydai 


DA 


35 


INCREMENT 


inc 


RS 




(Wl i H DELTA) 


incwd 


RS 




LOAD DIRECT 


Ida 


DA 


40 


LOAD FROM STORAGE 


l 


RS 




(WITH DELTA) 


iwd 


RS 




LOAD IMMEDIATE 


ii 


Rl 




(WITH DELTA) 


liwd 


Rl 




LOAD REGISTER 


ir 


RR 




MULTIPLY SIGNED 


mpy 


SPC 


30 


MULTIPLY SIGNED EXTENDED 


ropy* 


SPC 


MULTIPLY SIGNED EXTENDED IMMEDIATE 


mpyxi 


SPC 
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Table 1 {Paste 3 of 3|, fi«ed-Poim Acihmeiic Inspections 

M£M§ MNE- TYPME 

MOMIC 

MULTIPLY SIGNED IMMEDIATE mpyi SPC 

MULTIPLY UNSIGNED mpyu SPC 

MULTIPLY UNSIGNED EXTENDED rnpyux SPC 

MULTIPLY UNSIGNED EXTENDED IMMEDIATE inpyuxi SPC 

MULTIPLY UNSIGNED IMMEDIATE mpyui SPC 

STORE DIRECT std a DA 

STORE st RS 

(WITH DELTA) stw d RS 

STORE IMMEDIATE st$ R! 

(WITH DELTA) stiuwd Rl 

SUBTRACT DIRECT s{ j a OA 

SUBTRACT FROM S TO RACE $ rs 

WITH DELTA) swd RS 

SUBTRACT IMMEDIATE S | rj 

<WITH DELTA) a [ wd R| 

SUBTRACT REGISTER sr rr 

SWAP AND EXCLUSIVE OR WITH STORAGE swapx RR 

Table 2 (Page I or 3). Swage to Storage btslrudians 

M&Mg MNE- TYPME 

MONIC 

ADD STORAGE TO STORAGE sa SS 

(WITH DELTA) sa Wd ss 

ADD STORAGE TO STORAGE DIRECT aada SS 

ADD STORAGE TO STORAGE FINAL sal SS 

(WITH DELTA) sa | W d SS 

ADD STORAGE TO STORAGE FINAL DIRECT swrfda SS 

ADD STORAGE TO STORAGE INTERMEDIATE sai SS 

(WITH DELTA) sg^d SS 
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Table 2 (Page 2 of 3). Storage 10 Stooge kvEln^tione 

NAME MMEr HEME 

MOMIC 

ADO STORAGE TO STORAGE INTERMEDIATE 





DIRECT 


saida 


SS 




ADD STORAGE TO STORAGE LOGICAL 


sal 


SS 


10 


(WITH DELTA) 


salwd 


SS 




A00 STORAGE TO STORAGE LOGICAL DIRECT 


salda 


SS 




COMPARE STORAGE TO STORAGE 


sc 


SS 


;s 


(WITH DELTA) 


scwd 


SS 




COMPARE STORAGE TO STORAGE DIRECT 


scda 


SS 




COMPARE STORAGE TO STORAGE FINAL 


scf 


SS 


20 


(WITH OELTA) 


scfwd 


SS 




COMPARE STORAGE TO STORAGE FINAL DIRECT 


scf da 


SS 




COMPARE STORAGE TO STORAGE INTERMEDIATE 


sol 


SS 


2$ 


(WITH DELTA) 
COMPARE STORAGE TO STORAGE INTERMEDIATE 


sclwd 


SS 




DIRECT 


scida 


SS 


30 


COMPARE STORAGE TO STORAGE LOGICAL 


scl 


SS 


(WITH DELTA) 
COMPARE STORAGE TO STORAGE LOGICAL 


sclwd 


SS 




DIRECT 


sdda 


SS 


35 


MOVE STORAGE TO STORAGE 


smov 


ss 




(WITH DELTA) 


smovwd 


ss 




MOVE STORAGE TO STORAGE DIRECT 


smovda 


SS 


40 


SUBTRACT STORAGE TO STORAGE 


ss 


SS 




(WITH DELTA) 


sswd 


SS 




SUBTRACT STORAGE TO STORAGE DIRECT 


ssda 


SS 


<3 


SUBTRACT STORAGE TO STORAGE FINAL 




SS 




(WITH DELTA) 


sslwd 


SS 




SUBTRACT STORAGE TO STORAGE FINAL DIRECT 


ssfda 


SS 


90 


SUBTRACT STORAGE TO STORAGE INTERMEDIATE 


ssl 


SS 




(WITH DELTA) 




SS 
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Tablo 1 {Page 3 of 3). Storage to Storage Inslmclioits 
NAME 

SUBTRACT STORAGE TO STORAGE INTERMEDIATE 
DIRECT 

SUBTRACT STORAGE TO STORAGE LOGICAL 

(WITH DELTA) 
SUBTRACT STORAGE TO STORAGE LOGICAL 

DIRECT 



MNE- 
MONIC 

saida 
ssl 

sslwd 
sslda 



mm 



ss 
ss 
ss 

ss 



Table 3 



20 



Logical listfuoMons 


NAME 


MNEMONIC 


TYPME 


AND DIRECT ADDRESS 


nda 


DA 


AND FROM STORAGE 


A 


RS 


<W1TH DELTA) 


nwd 


RS 


AND IMMEDIATE 


nJ 


Rl 


{WITH DELTA) 


niwd 


Rl 


AND REGISTER 


rar 


RR 


OR DIRECT ADDRESS 


oda 


DA 


OR FROM STORAGE 


0 


RS 


tWITH DELTA) 


owd 


RS 


OR IMMEDIATE 


d 


Rl 


(WITH DELTA) 


oiwd 


Rl 


OR REGISTER 


or 


RR 


XOR DIRECT ADDRESS 


xda 


DA 


XOR FROM STORAGE 


X 


RS 


(WITH DELTA) 


xwd 


RS 


XOR IMMEDIATE 


* 


Rl 


(WITH DELTA) 




Rl 


XOR REGISTER 


XT 


RR 
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Tawe 4 \Pzqq 1 of 2}. Shin Instructions 








NAME 




TYPP 






WON1C 






SCALE BINARY 


scale 


SPC 




SCALE BiNARY IMMEDIATE 


scafel 


SPC 




SCALE BINARY REGISTER 


scalar 


SPC 


10 


SCALE HEXADECIMAL 


scaleh 


SPC 




SCALE HEXADECIMAL IMMEDIATE 


scalehl 


SPC 


/5 


SCALE HEXADECIMAL REGISTER 


scalehr 


SPC 




SHIFT LEFT ARITHMETIC BINARY 


sla 


SPC 




SHIFT LEFT ARITHMETIC BINARY IMMEDIATE 


sJai 


SPC 


20 


SHIFT LEFT ARITHMETIC BINARY REGISTER 


slar 


SPC 




SHIFT LEFT ARITHMETIC HEXADECIMAL 


slab 


SPC 




SHIFT LOFT ARITHMETIC HEXADECIMAL IMMEDIATE 


Slilhi 


SPC 


20 


SHIFT LEFT ARITHMETIC HEXADECIMAL REGISTER 


sJahr 


SPC 




SHIFT LEFT LOGICAL BINARY 


sit 


SFC 




SHIFT LEFT LOGICAL BINARY IMMEDIATE 


sill 


SPC 


30 


SHIFT LEFT LOGICAL BINARY REGISTER 


slJr 


SPC 




SHIFT LEFT LOGICAL HEXADECIMAL 


Sllfl 


SPC 




SHIFT LEFT LOGICAL HEXADECIMAL IMMEDIATE 


sllhi 


SPC 


05 


SHIFT LEFT LOGICAL HEXADECIMAL REGISTER 


sJIhr 


SPC 




SHIFT RIGHT ARITHMETIC BINARY 


sra 


SPC 




SHIFT RIGHT ARITHMETIC BINARV IMMEDIATE 


srai 


SPC 




SHIFT RIGHT ARITHMETIC BINARY REGISTER 


srar 


SPC 




SHIFT RJGHT ARITHMETIC HEXADECIMAL 


srah 


SPC 




SHIFT RIGHT ARITHMETIC HEXADECIMAL IMME- 


srahl 


SPC 




DIATE 








SHIFT RIGHT ARITHMETIC HEXADECIMAL REGISTER 




SPC 




SHIFT RIGHT LOGICAL BINARY 


srt 


SPC 


30 


SHIFT RIGHT LOGICAL BINARY IMMEDIATE 


srli 


SPC 
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Table 4 (Pag* * of 2). Shift Inslrgclion* 

NAME MM: TYPty) ^ 

MOWjg 

SHIFT RIGHT LOGICAL BINARY REGISTER srlr SPC 

SHIFT RIGHT LOGICAL HEXADECIMAL srlh SPC 

SHIFT RIGHT LOGICAL HEXADECIMAL IMMEDIATE srlhl SPC 

SHIFT RIGHT LOGICAL HEXADECIMAL REGISTER srlhr SPC 



»5 Table S (Page 1 of 2). Branch Instructions 





NAME 


MNE- 
MONIC 


TYP 


*V 


BRANCH 


b 


RS 




{WITH DELTA) 




RS 

r\o 




BRANCH DIRECT 


bda 


DA 


25 


BRANCH IMMEDIATE 


bi 


Rl 




(WITH DELTA) 


biwd 


Rl 




BRANCH REGISTER 


br 


RS 


30 


BRANCH AND LINK 


bal 


RS 


BRANCH AND LINK DIRECT 


baida 


DA 




BRANCH AND LINK IMMEDIATE 


ball 


Rl 




{WITH DELTA) 


baliwd 


Rl 


05 


BRANCH AND LINK REGISTER 


balr 


RS 




BRANCH BACKWARD 


bb 


RS 




(WITH DELTA) 


bbwd 


RS 




BRANCH BACKWARD DIRECT 


bbda 


DA 




BRANCH BACKWARD IMMEDIATE 


bbl 


Rl 




(WITH DELTA) 


bblwd 


Rl 


*s 


BRANCH BACKWARD REGISTER 


bbr 


RS 




BRANCH FORWARD 


bf 


RS 




(WITH DELTA) 


bfwd 


RS 


so 


BRANCH FORWARD DIRECT 


bfda 


□A 
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so 



Tatte S (Page 2 of 2). Branch Instructions 
NAME 

BRANCH FORWARD IMMEDIATE 

(WITH OELTA) 
BRANCH FORWARD REGISTER 
BRANCH ON CONDITION 

(WITH DELTA) 
BRANCH ON CONDITION DIRECT 
BRANCH ON CONDITION IMMEDIATE 

(WITH OELTA) 
BRANCH ON CONDITION REGISTER 
BRANCH RELATIVE 

(WITH DELTA) 
NULL OPMERATION 



MNE- 
MONIC 
bfi 

bfiwd 

bfr 

be 

bevvd 
beda 
bet 

bdwd 

bcr 

brel 

brelwd 

nooo 



mm 

Rl 

Rl 
RS 
RS 
RS 
RS 
Rl 
Rl 
RS 
Rl 
RS 
RR 



Toblse 
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Stesus Etching instructions 


NAME 


MNEMONIC 


TypME 


RETURN 


ret 


SPC 



Table 7 
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Input/Output Inductions 


NAME 


MNEMONIC 


TYPME 


IN 


IN 


SPC 


OUT 


OUT 


SPC 


INTERNAL DIGR/GIOW 


INTR 


SPC 
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SOME SUMMARY FEATURES 

The APAP Meshing hi Perspective 

We have described m accordance vtfth our invention could be thought of in its more delafled aspects to 
be positioned in the technology somewhere between the CM-i and N-cube. Like our APAP, the CM-1 uses 
a point design lor the processing element and combines processing elements with memory on the basic 
chip. The CM-t however uses a 1 bit wide serial procsssor.while Hie APAP series will use a 16 bit wMs 
proceesor. The CM series erf machines etanied wHh 4K bits of memory pei processor and lies grown to 8 or 
18K fate? versus tfie 32K by 16 bte we have provided tor the first APAP chip. The CM-1 and rts fcBow-ons 
are strictly SIMD machine* while Ihe CM-5 Is a hybrid. Instead of this, our APAP will effectively use MIMD 
operate modes in conjunction with SIMD modes when u$eM. While our psrelfel 18 bit wide PMEs might 
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be viewed as a step toward me N-cube, this step is not warranted. The APAP does not separate memory 
and muting from the processing element as does the N-cube kind of machine. Also, the APAP provides lor 
up to 32K 18 bit PMEs while the N-cube only provides for 4K 32 bit processors. 

Even with the superficial similarities presented above, the APAP concept completely differs from the 
s CM and N-cube series by: 

1 . The modified nyp&cubs incorporated in our APAP is a new invention providing a signracant packaging 
and addressing advantage when compered with hypercube topologies, For Instance, consider that the 
32K PME APAP In tts first preferred embodiment has a network diameter of 19 logical steps and with 
transparency, this can be reduced te an effective 16 logical steps. Further, by oompsriaon, a pure 

?o hypercube were used, and if all PMEs were sending data through an S step path, then on average 2 ol 
every 8 PMEs would be active while the remainder would be delayed due to blockage. 

Alternatively, consider the 64K hyperctbs that would be needed it CM- 1 was a pure hypercube. In 
that case* each PME would require ports id 16 other PMEs, and data could be routed between (he two 
feifresi separated PMEs in i& logical stepg. K all PMEs tried to transfer an average distance of 7 steps, 

ys the 2 of every 7 would be active. However, CM-1 does not utilize a I6d hypercube. £ interconnects the 
1 8 nodes on a chip with a NEWS network: then it provides one router function within the chip. The 4096 
routers are connected into a i2d hypercube Wsh no coffisions the hybrid stiB has a logfcel diameter of 
15, but since id PMEs could be contending tor tie iinh He effective diameter Is much greater. That Is, 
with 6 step moves only 2 or 16 PMEs could be active, which means thai 6 complete cycles rather than 4 

20 cycles ere needed to complete e» data moves. 

The N-cube actually utilizes a pure hypercube, but currently only provides tot a 4098 PMEs and 
thus, utilizes a I2d {13d for 8192 PMEs> hypercube. For the N-cube to grow to ieK processors, at which 
point h would have the same processing data width as the APAP, it would have to add four limes as 
much hardware and would have 10 increase the connection ports to each PME router by 25%. Although 

29 no hard data exists to support this conclusion, would appear that the N-cube architecture runs out of 
connector pins prior to reaching a 16K PME machine. 

2. The completely integrated and distributed nature of major tasks within the APAP machine is a decided 
advantage. As was noted for the CM and N-cube series of machines, each had to have separate units for 
message routing as well as separate unite for floating point coprocessors. The APAP system combines 

30 the Integer, floating point processing, message routing and I/O control Into (he single point design PME. 
Tliat design is then repricated 8 iimes on a chip, and the chip Is *en replicated 4K times to produce tfie 
array. Thrs provides several advantages; 

a. Using one chip means maximum size production runs and minimal system factor costs, 
b* Regular architecture produces the most effective programming systems. 
33 & Almost aD chip pins can be dedicated to fte generic problem of interprocessor communication, 

maximizing the Inter-chlp I/O bandwidth which tends to be a important Smiting factor to MPP designs. 

3. The APAP has the unique design ability to take advantage of chip technology gams and capital 
investment in custom chip designs. 

Consider the question of floating point performance, it Is anticipated that APAP PME performance on 
40 OAXPY will be about 125 cycles per flop, in contrast the '387 coprocessor would be about 14 cyctes 
while ihe Weitec Coprocessor in (he CM-1 would be about 6 cycles. However, in the CM case there is 
only one floating point unit for wry 13 PMEs while In the N-cube case frere Is probably one *38? type 
chip associated wHh each of the *386 processors. Our APAP has 16 times as many PMEs and tferetore 
can almost completely make up for the single unit performance delta. 
49 More significantly, the 8 APAP pmes within a chip are constructed from 50K gates curie/fly 

available in the technology. As memory macros shrink and the number of gates avsHabte to the logic 
increases. Spending that increase on enhanced floating point normanzaiion should permH APAP Healing 
point performance to tar exceed the other unto Alternatively, effort could be spent to generate a PME or 
PME subsection design using custom design approaches, enhancing total performance while In no way 
so affecting any S/W developed lor the machine. 

We believe our design lor our APAP has charactenstics poised to take advantage of the future 
process technology growth. In contrast, mo neatest similar machines CM-x and N-cube which employ a 
system like that described In FIGURE 1 seem well poised to take advantage of yesterday's technology 
which we leei Is dead ended, 
as An advantage of the APAP concept ts the abiSty to use OASO associated with groups of PMEs- This 
APAP capability, as wen as the ability to connect displays and auxiliary storage, is a by-product of picking 
MC bus structures as me Intsiface to the external I/O porrs of the PME Array, Thus, APAP systems will be 
configurable and can include card mounted hard drives selected from one of the set of unite that are 
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compatible wlh P&2 or RISC/6000 units. Ftirfrer. that capebPrty should be available without designing any 
additional part number modules aiaiough it does require utlzJng more repflcationa of the batiqpanel and 
base enclosure than does fre apap. 

This brief perspective is not intended to be limiting, but rather is intended to cause those stilled in trio 

s art to review the foregoing description and examine how the many Inventions we have described which may 
be used to move me art of massively parallel systems ahead to a time when programmlno 18 no longer a 
significant problem and the cos© of such systems are much lower. Our kind of system can be made 
available, not only to the few, but to many as It could be made at a cost vtfthln the reach of commercial 
department level procurement*. 

10 While we have described our preferred embodiments of our invention, it win be understood that those 
skilled in tfie art both now and in {he future, upon the understanding of these discussions will make various 
improvements and enhancements thereto which fall within the scope of the claims which follow. These 
claims should be construed to maintain (he proper protection for the invention first disclosed, 

rs Claim* 

1. A computer system comprise e pfurefty of muffl-prooessor memory elements, each having commit- 
ntoatton paths, processor and memory, and wherein a programmable router is provided tar routing data 
and control Irtformaiion from one multt-processor memory element to another multi-processor memory 

20 element «nd between nodes of the computet system. 

2. A computer system according to claim i wherein each multiprocessor memory element <PME) has 2n 
processors, and communication paths which minimize delays due to chip crossings, 

« i A computer system according to claim 1 wherein each rnuia'-procecscr memory etemene fPME} has a 
processor, memory and routers wanin a single chip and internal and external communication paths 
which minimize delays due to chip crossings, each processor memory element having means for fixed 
and floating point processing, rowing and HO controL 

30 4. A computer system according to claim 1 further comprising wtthin a processor memory element 

a native instruction set means for providing an expandable multiply function, a programmer router for 
routing information altsmabvely ietfifrighl, 

NEWS matrix, NEWS/up-dowrv hypercube. and wherein said programmable router is employs a 
hardwired distributed router provided by each processing memory element 

38 

5. A computer computer system according to claim 1 organized as a massively parallel machine with 
nodes interconnected as a n dimensional network cluster with parallel communication pafia between 
processor memory elements along said internal and external communication pains providing a process- 
ing array, and wherein processing memory elements of an array have a transparent mode ublzed when 
routing data between processing memory elements wrtrtin a chip set of prooeeciftg memory elements 
for permitting reduction of the effective network diameter of a network of nodes. 

6. A computer system according to claim i wherein a node of a processor array has mutifete single 
processor elements made up of 32K 16-bit words with a '18-bil processor for a network node of eight 

« processors with their associated memory atth their fully distributed I/O routers and signal I/O ports. 

7. A computer system according to claim 1 wiwefn a node of a processor array has muJUp& single 
processor elements made up of 32K 19-bit words with a 16-bH processor for a network node of eight 
processors with their associated memory with their fully distributed MO routers and signal I/O ports 

so combined as groups groups of node clusters organized as a 2d modified hypercube, 

& A computer system according id claim i wherein a node of a processor array has muMpte single 
processor elements made up of 32K 16-ftt words with a 13-blt processor for a network node of eight 
processors wfm their associated memory with thetr fully distributed VO router d and signal VQ porta 
55 combined as groups groups of node clusters organized as a 2d modified hypercube, wilh up to 64 
clusters integrated in a network of node clusters to form ere integrated to form a id modified 
hypercube of up to 32,768 processing memory elements. 
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9. A computer system according to claim 1 wherein a nod* processing memory dement has internal data 
flows using high speed hard regiaterc to feed distributed ALU and and I/O router registers and logic Tor 
aD operations, 

5 10l A computer system according to claim 1 wherein a node processing memory element in has an I/O 
port for off cmp byte wide oornmijnkiation, and has input ports that are connected such fret data may 
be routed from input to memory, or from en Input address register to an output register via a direct 
parallel data path- 

?o 11. A computer system according to claim 1 wherein a node has multiple processor memory elements and 
is connected to other nodes In * cluster network with data routing distributed between haroVvere and 
software, with softwar e contmiling most oi the task sequencing function. 

12. A computer system according to claim i wherein a node has multiple processor memory elements and 
?s is connected to other nodes in a cluassr network with data routing distributed between hardware and 
software, with hardware provided for performing Inner loop transfers and minimizing overhead on the 
outer loops of the node. 

13L A computer system according to claim 1 wherein a node has multiple processor memory elements and 
so is connected to other nodes in a duster network with i/o prorjrams et dedicated interrupt levels for 
managing the network, 

14* A computer system according to claim 1 wherein a node has multiple processor memory elements and 
is connected to ofter nodes in a cluster network with UO programs a! dedicated interrupt levels for 
29 managing the network, each processor memory element having interrupt registers end dedicating four 
interrupt levels to receiving dale from four neighbors, a buffer provided at each level by loading 
regislsrs ai the level, and having in and return instruction pairs using a buffer address and transfer 
oount to enable the processor memory element to accept words from an input bus and lo stoie them to 
the buster. 



30 



1&. A mule-processor memory system, comprising: a plureJrty of mufti-processor memory elements, each 
multi-processor rrremcry element (PME) having 2n processors, memory and routers within e srnoje chip 
and interna] and externa! communication petfia which minimise delays due to chip crossings. 
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